Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News NLP Library spaCy 3.0 Features Transformer-Based Models and Distributed Training

NLP Library spaCy 3.0 Features Transformer-Based Models and Distributed Training

This item in japanese


AI software makers Explosion announced version 3.0 of spaCy, their open-source natural-language processing (NLP) library. The new release includes state-of-the-art Transformer-based pipelines and pre-trained models for 17 languages.

The release was announced on Explosion's blog. In addition to spaCy's pre-trained models, the new version also provides interoperability with custom PyTorch and TensorFlow models and the HuggingFace Transformer library. For training custom models, spaCy introduces a new workflow and a configuration definition system as well as support for distributed training using Ray. A project definition system allows for easy integration with 3rd-party tools such as FastAPI for serving models in production. According to the Explosion team,

Our main aim with the release is to make it easier to bring your own models into spaCy, especially state-of-the-art models like transformers.

Explosion announced the alpha release of spaCy in 2015 and version 1.0 the following year. Version 2.0 was released in 2017 and included pre-trained convolutional neural network (CNN) models for several languages. That same year, Google's landmark paper on Transformers was published, and the following year Google released the BERT model, leading to widespread adoption of the Transformer deep-learning architecture and new state-of-the-art results in the NLP space.

The key concept in using spaCy is the processing pipeline, a sequence of NLP operations performed on the input text, such as part-of-speech (POS) tagging or named-entity recognition (NER). The new version of spaCy includes a Transformer pipeline component which can wrap 3rd-party models such as the HuggingFace library. The release also includes several pre-built Transformer-based pipelines for English, German, Spanish, French, and Chinese. These models achieve near state-of-the-art performance on several tasks; for example, the English model's POS tagger scores 97.8% accuracy, compared with 97.96% for the leading model. Pre-trained pipelines which do not use Transformers are available for 10 other languages.

spaCy Pipeline

Image Source:

The release introduces a new configuration system for training custom spaCy pipelines. All configuration information is stored in a single config file. The file explicitly includes all parameters required for training, with no hidden defaults; the goal is to make it easier to document all configuration and track changes. The configuration system also supports importing custom components from deep-learning frameworks, including TensorFlow, PyTorch, and MXNet. The release also includes a project definition system that manages end-to-end machine-learning workflows and integrations with other tools, including Ray for distributed training and FastAPI for hosting model-serving apps.

Although the Explosion team did try to "keep the breaking changes to a minimum," the release does introduce some incompatibilities. First, spaCy 3.0 has dropped support for Python 2; the minimum version of Python supported is 3.6. There are also several APIs that were removed, although most have been deprecated for some time. The release documentation includes a migration guide for users upgrading to the new version.

Several users praised spaCy in a discussion about the new release on Hacker News. One user claimed that the new version "practically covers 90% of NLP use-cases with near [state-of-the-art] performance." Explosion co-founder and spaCy developer Matthew Honnibal also joined the discussion. In response to one user's question about adopting the library, he noted:

Overall the theme of what we're doing is helping you to line up the workflows you use during development with something you can actually ship....All that said...if your main goal is to develop a model, run an experiment and publish a paper, you might find spaCy doesn't do much that makes your life easier.

The spaCy 3.0 source code and release notes are available on GitHub.

Rate this Article