Researchers at Amazon Alexa AI have announced Alexa Teacher Models (AlexaTM 20B), a 20-billion-parameter sequence-to-sequence (seq2seq) language model that exhibits state-of-the-art performance on 1-shot and few-shot NLP tasks. AlexaTM 20B outperforms GPT-3 on SuperGLUE and SQuADv2 benchmarks while having fewer than 1/8 the number of parameters.
The model and experiments were described in an Amazon Science whitepaper. Unlike other large decoder-only language models such as GPT-3 and PaLM, AlexaTM 20B is a seq2seq model; that is, it contains an encoder as well as a decoder. The encoder stage gives AlexaTM 20B better performance on summarization and machine translation (MT) tasks than larger decoder-only models such as PaLM. The model is multilingual and achieves state-of-the-art performance on few-shot MT tasks on the Flores-101 dataset, even on low-resource languages. According to co-author Saleh Soltan,
All in all, we demonstrated in our work that the proposed style of pretraining enables seq2seq models that outperform much larger decoder-only LLMs across different tasks, both in a few-shot setting and with fine-tuning. We hope our work presents a compelling case for seq2seq models as a powerful alternative to decoder-only models for LLM training.
The Alexa research team noted that their work is subject to several constraints that do not generally apply to language models. The Alexa digital assistant supports multiple languages, and the input text is "spoken-form" which can be different from the written form of text used in training datasets. Further, because their work is intended to be used in an edge device, memory is at a premium and the model inference must be low-latency; both of these favor smaller models.
To further reduce model size, the Amazon team investigated knowledge distillation. In a paper to be presented at the upcoming Knowledge Discovery and Data Mining Conference (KDD), the researchers demonstrated using a large model as a teacher. The team then trained smaller student models which were only 0.2% the size of the teacher (for example, 17M parameters vs 9.3B).
The researchers evaluated the 20B teacher model on several NLP benchmarks. On the MLSum benchmark, AlexaTM outperformed the state-of-the-art for 1-shot summarization in German, Spanish, and French and on 1-shot MT tasks for most language pairs. In particular, on low-resource languages like Telugu, Tamil, and Marathi, the improvement was "significant." The model outperformed GPT-3 on MT tasks "in most English centric cases." Although the model outperformed GPT-3 on most SuperGLUE NLP tasks, it trailed behind Google's much larger PaLM model.
Several users discussed the work in a thread on Hacker News. One pointed out the advantages of AlexaTM 20B over GPT-3:
Building a model downstream of GPT-3 is difficult and usually yields suboptimal results; however, 20b is small enough that it would be easy to finetune this on a smaller dataset for a specific task. You could then distill that model and end up with something that’s a fraction of the size (6b parameters for example, just under 1/3, would fit on commercial GPUs like 3090s).
The AlexaTM 20B model has not yet been publicly released, but the researchers created a repository for it on GitHub and note that it will be released soon.