Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Google Releases TensorFlow.Text Library for Natural Language Processing

Google Releases TensorFlow.Text Library for Natural Language Processing

This item in japanese

Google released TensorFlow.Text (TF.Text), a new text-processing library for their TensorFlow deep-learning platform. The library allows several common text pre-processing activities, such as tokenization, to be handled by the TensorFlow graph computation system, improving consistency and portability of deep-learning models for natural-language processing (NLP).

In a recent blog post, software engineer and TF.Text team-lead Robbie Neale gave a high-level overview of the contents of the new release, focusing on the new tools for tokenizing text strings. The library also includes tools for pattern-matching, n-gram creation, unicode normalization, and sequence constraints. The code is designed to operate on RaggedTensors: variable-length tensors which are better-suited for processing textual sequences. A key benefit of the library, according to Neale, is that these pre-processing steps are now first-class citizens of the TensorFlow compute graph, which gives them all the advantages of that system. In particular, according to the documentation, "[y]ou do not need to worry about tokenization in training being different than the tokenization at inference...." 

Because deep-learning algorithms require all data to be represented as lists of numbers (a.k.a. tensors), the first step in any natural-language processing task is to convert text data to numeric data. Typically this is done in pre-processing scripts before handing the result to the deep-learning framework. The most common operation is tokenization: breaking the text into its individual words. Each unique word is given a numeric ID; often this is simply its index in a list of all known words. The result is a sequence of numbers which can be input to a neural network.

However, even though tokenization is not strictly part of the neural network model, it is a necessary component of the full NLP "pipeline." Any code that uses a trained neural network for NLP inference must replicate the tokenization and other pre-processing tasks that were including in the training system. Now that TF.Text allows these pre-processing tasks to be represented as operations in TensorFlow, the full pipeline can be saved as part of the model and consistently reproduced at inference time with no extra steps.

A Twitter user pointed out that Keras, the high-level deep-learning API that runs on top of TensorFlow, already has text pre-processing functionality. Neale replied:

Keras has a subset, but not the breadth of TF.Text. We are actively talking with them to fill in gaps we believe language engineers want, but are not provided in the core Keras API, and I wouldn't be surprised if additional Keras layers are provided by TF.Text in the future.

The TensorFlow.Text source code and a tutorial notebook are available on GitHub.

Rate this Article