A team of scientists at LMU Munich have developed Pattern-Exploiting Training (PET), a deep-learning training technique for natural language processing (NLP) models. Using PET, the team trained a Transformer NLP model with 223M parameters that out-performed the 175B-parameter GPT-3 by over 3 percentage points on the SuperGLUE benchmark.
PhD student Timo Schick and professor Hinrich Schütze of the university's Center for Information and Language Processing described their process and experimental results in a paper published on arXiv. PET is a technique for fine-tuning a pre-trained language model that generates additional "soft-labeled" training data from unlabeled examples. This helps the model improve performance in "few-shot" scenarios, such as NLP benchmarks which have very few labeled examples for fine-tuning. Using PET, the researchers fine-tuned an ALBERT Transformer model and achieved an average score of 76.8 on the SuperGLUE benchmark, compared to GPT-3's 71.8.
Supervised machine learning often requires large datasets to perform well on tasks such as computer vision or NLP. However, labeling these large datasets can be time-consuming and expensive, as it requires human workers to manually identify objects in images or rate a sentence's sentiment. For NLP tasks, many researchers have turned to transfer learning, where a large model is pre-trained via self-supervised learning on a large unlabeled dataset, such as the contents of Wikipedia. Once a model is pre-trained, it can be "fine-tuned" for a specific task, such as sentiment analysis, using supervised learning on a much smaller labeled dataset. Most state-of-the-art NLP results are achieved by fine-tuning a pre-trained Transformer model.
Few-shot learning is a scenario related to fine-tuning that tests a model's ability to generalize to new tasks, given only a few examples of that task---often fewer than one hundred, sometimes as few as one ("one-shot") or even none ("zero-shot"). OpenAI's 175B-parameter GPT-3 showed that a large pre-trained model could perform well in few-shot learning scenarios, without even fine-tuning the model's parameters; instead, updating the model's internal state or "context" with a textual description of the task along with text examples was sufficient to produce "near state-of-the-art results" with only 32 examples. However, Schick and Schütze point out some drawbacks of this strategy: limits on the context size restrict the number of examples that can be used, and more importantly, it relies on a model that is so large it is "usable in many real-world scenarios."
In order to achieve similar performance with a smaller model, the researchers have developed PET, a semi-supervised training technique that generates additional training data from the few-shot examples. PET works by first converting the input examples into cloze-style phrases. These are used to fine-tune an ensemble of language models which are then used to annotate a large unlabeled dataset to produce a "soft-labeled" dataset. The final model is then fine-tuned on the soft-labeled data. Applying PET to the SuperGLUE datasets, the team created a soft-labeled dataset called FewGLUE, which they used to fine-tune an ALBERT model that exceeded GPT-3's few-shot performance on the SuperGLUE benchmark.
Lead author Schick answered several questions about the work in discussions on Reddit. Commenters noted that although PET produced better results for NLP benchmarks, GPT-3 appeared more flexible. Schick agreed that:
GPT-3 certainly is much better than our approach at generating long sequences of text (e.g., summarization or machine translation).
Schick and Schütze have open-sourced their PET code and FewGLUE dataset on GitHub.