AI Models from Google and Microsoft Exceed Human Performance on Language Understanding Benchmark

Research teams from Google and Microsoft have recently developed natural language processing (NLP) AI models which have scored higher than the human baseline score on the SuperGLUE benchmark. SuperGLUE measures a model's score on several natural language understanding (NLU) tasks, including question answering and reading comprehension.

Both teams submitted their models to the SuperGLUE Leaderboard on January 5. Microsoft Research's model Decoding-enhanced BERT with disentangled attention (DeBERTa) scored a 90.3 on the benchmark, slightly beating Google Brain's model, based on the Text-to-Text Transfer Transformer (T5) and the Meena chatbot, which scored 90.2. Both exceeded the human baseline score of 89.8. Microsoft has open-sourced a smaller version of DeBERTa and announced plans to release the code and models for the latest model. Google has not published details of their latest model; while the T5 code is open-source, the Meena chatbot is not.

The General Language Understanding Evaluation (GLUE) benchmark was developed in 2019 as a method for evaluating the performance of NLP models such as BERT and GPT. GLUE is a collection of nine NLU tasks based on publicly-available datasets. Because of the rapid pace of improvement in NLP models, GLUE's evaluation "headroom" has diminished, and researchers introduced SuperGLUE, a more challenging benchmark.

SuperGLUE contains eight subtasks:

BoolQ (Boolean Questions) - a question answering task where the model must answer short yes-or-no questions
CB (CommitmentBank) - a textual entailment task where the hypothesis must be extracted from an embedded clause
COPA (Choice of Plausible Alternatives) - a causal reasoning task where the model is given a premise and two possible cause-or-effect answers
MultiRC (Multi-Sentence Reading Comprehension) - a question answering task where the model must answer a question about a context paragraph
ReCoRD (Reading Comprehension with Commonsense Reasoning Dataset) - a question answering task where a model is given a news article and a Cloze-style question about it, in which one entity is masked out. The model must choose the proper replacement for the mask from a list
RTE (Recognizing Textual Entailment) - a textual entailment task where the model must determine whether one text contradicts another or not
WiC (Word-in-Context) - a word sense disambiguation task where the model must determine if a single word is used in the same sense in two different passages
WSC (Winograd Schema Challenge) - a coreference resolution task where a model must determine a pronoun's antecedent

All these tasks were chosen from published works in the NLP research space. Some of them---WiC, MultiRC, RTE, and ReCoRD---included human performance baselines in their original papers. To determine a baseline human performance for the rest, the SuperGLUE team hired human workers via Amazon Mechanical Turk to annotate the datasets.

In early 2020, Google Brain announced their Meena chatbot. Google did not release the code or pre-trained model, citing challenges related to safety and bias. The team did publish a paper describing the architecture, which is based on a 2.6B parameter sequence-to-sequence neural network called Evolved Transformer. By contrast, the T5 transformer, used in the new model, is open-sourced with several available model files up to 11B parameters. Google has not published details about its leaderboard entry; the description on the SuperGLUE leaderboard says it is a "new way of combining T5 and Meena models with single-task fine-tuning," and that a paper will be published soon.

Microsoft announced the latest 1.5B parameter version of their DeBERTa model in a recent blog post. Originally released in mid-2020, DeBERTa improves on BERT-derived architectures using three new techniques: disentangled attention, enhanced mask decoder, and virtual adversarial training for fine-tuning. Disentangled attention separates the word embedding from content embedding, in contrast to standard BERT models where the two values are summed together. The enhanced mask decoder uses this separate position information to improve its predictions. The fine-tuning approach improves stability during adversarial training, which increases model generalization.

While Microsoft's research teams continue to make progress in the NLP field, including DeBERTa and a 17B parameter model called Turing-NLG, in late 2020 the company announced an exclusive license of OpenAI's 175B parameter NLP model, GPT-3, which was trained on a supercomputer hosted in Microsoft's Azure cloud. OpenAI, like Google, has been slower to release their pre-trained models, due to concerns about misuse.

Microsoft's DeBERTa 2020 code and smaller pre-trained models are available on GitHub; the company plans to release the latest version soon. Google's T5 code and models are also available on GitHub.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter