Microsoft and Google Release New Benchmarks for Cross-Language AI Tasks

Research teams at Microsoft Research and Google AI have announced new benchmarks for cross-language natural-language understanding (NLU) tasks for AI systems, including named-entity recognition and question answering. Google's XTREME covers 40 languages and includes nine tasks, while Microsoft's XGLUE covers 27 languages and eleven tasks.

The two benchmarks and related experiments were described in papers published on arXiv. Microsoft's XGLUE is a cross-language extension of the General Language Understanding Evaluation (GLUE) for English NLU tasks, and includes language-generation scenarios as well as understanding tasks; the Microsoft team claims that XGLUE is the "first attempt" at creating cross-language generation task benchmarks. Google's Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark covers nine NLU tasks across several categories---sentence classification, sentence retrieval, structured prediction, and question answering---in a wide range of "typologically diverse" languages, including several under-studied languages from Africa and southern India. In a post on Google AI's blog, team members Melvin Johnson and Sebastian Ruder wrote,

We hope that XTREME will catalyze research in multilingual transfer learning, similar to how benchmarks such as GLUE and SuperGLUE have spurred the development of deep monolingual models, including BERT, RoBERTa, XLNet, AlBERT, and others.

To evaluate a model using XTREME, the model is pre-trained on a multi-lingual text corpus "using objectives that encourage cross-lingual learning"; typically, this corpus will be the contents of Wikipedia, from each of the languages it supports. Next, the model is fine-tuned on task-specific data; this data is English-only. Finally, XTREME evaluates the model on task-specific datasets in other languages. The datasets for these tasks are:

Sentence classification: Cross-lingual Natural Language Inference (XNLI) corpus and Cross-lingual Paraphrase Adversaries from Word Scrambling (PAWS-X)
Structured prediction: Universal Dependencies v2.5 parts-of-speech (POS) dataset and Wikiann named-entity recognition (NER) dataset
Question answering: Cross-lingual Question Answering Dataset (XQuAD), Multilingual Question Answering (MLQA), and Typologically Diverse Question Answering (TyDiQA-GoldP)
Sentence retrieval: Building and Using Parallel Corpora (BUCC) and Tatoeba datasets

The Google team tested several state-of-the-art models on their benchmark, including multilingual BERT (mBERT), XLM, XLM-R, and M4. They found that the models achieve "close to human" performance on English, with much lower performance on other languages, particularly on the sentence retrieval and structured prediction tasks. Of the models, XLM-R performed best.

Microsoft's XGLUE uses several of the same tasks as XTREME, including MLQA, XNLI, PAWS-X, NER, and POS. It also includes news classification and page ranking tasks, as well as question and news-title text generation tasks. The Microsoft team also created an extension of the Unicoder pre-trained model for cross-language NLU tasks. In their experiments, they compared this model with the mBERT, XLM, and XLM-R models. They found that Unicoder outperformed the other models on "almost all tasks."

In response to the publication of these benchmark papers, Alexis Conneau, a Facebook AI researcher and co-inventor of XNLI and XLM, tweeted:

Building a strong and trusted evaluation benchmark is so important although sometimes undervalued. Thanks to [Sam Bowman] and his team, we have GLUE which has helped us all showcase the importance of language model pretraining. And now we have strong XLU benchmarks as well.

The Microsoft team has not yet released the code or models XGLUE. Code and data for XTREME are available on GitHub. Google's blog also promises an "upcoming website launch with a submission portal and leaderboard" for models to be measured against the benchmark.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter