BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Stanford NLP Group Releases Stanza: a Python NLP Toolkit

Stanford NLP Group Releases Stanza: a Python NLP Toolkit

This item in japanese

The Stanford NLP Group recently released Stanza, a new python natural language processing toolkit. Stanza features both a language-agnostic fully neural pipeline for text analysis (supporting 66 human languages), and a python interface to Stanford's CoreNLP java software.

Stanza version 1.0.0 is the next version of the library previously known as "stanfordnlp". Researchers and engineers building text analysis pipelines can use Stanza's tools for tasks such as tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named-entity recognition (NER). Compared to existing popular NLP toolkits which aid in similar tasks, Stanza aims to support more human languages, increase accuracy in text analysis tasks, and remove the need for any preprocessing by providing a unified framework for processing raw human language text. The table below comparing features with other NLP toolkits can be found in Stanza's associated research paper.

Stanza's pipeline is trained on 112 datasets, including many multilingual corpora like the Universal Dependencies (UD) treebanks. The UD project attempts to facilitate multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective by developing cross-linguistically consistent treebank annotation for over 70 languages. The fully neural architecture applied to Stanza generalizes well as it helps achieve competitive performance on all languages tested.

The research paper displays the results after tests run on the UD treebanks dataset and a multilingual NER dataset. On the UD treebanks, Stanza shows that it's language-agnostic pipeline architecture is able to adapt to different languages by scoring the highest macro-averaged scores over 100 treebanks which covers 66 languages.

On the NER component, Stanza achieves similar F1 scores to FLAIR (on 75% smaller NER models) and outperforms spaCy.

Stanza also offers a python interface for accessing Stanford's Java CoreNLP software which provides additional tools to NLP practitioners. Taking advantage of CoreNLP's existing server interface, Stanza adds a robust client which starts up the CoreNLP server automatically as a local process when the client is instantiated. The client communicates with the server through RESTful APIs.

In the future the team behind Stanza hopes to provide an interface for outside researchers to contribute their models, improve the computational efficiency, and extend the functionalities by implementing other processors. The team at spaCy quickly migrated spacy-stanza (which allows users to import Stanza models as spaCy pipelines) to work with this new API.

 

Rate this Article

Adoption
Style

BT