A team of researchers from OpenAI recently published a paper describing GPT-3, a deep-learning model for natural-language with 175 billion parameters, 100x more than the previous version, GPT-2. The model is pre-trained on nearly half a trillion words and achieves state-of-the-art performance on several NLP benchmarks without fine-tuning.
In paper published on arXiv, a team of over 30 co-authors described the model and several experiments. The researchers' goal was to produce an NLP system that performs well on a variety of tasks with little or no fine-tuning, and previous work had indicated that larger models might be the solution. To test that hypothesis, the team increased the size of their previous model, GPT-2, from 1.5 billion parameters to 175 billion. For training, the team collected several datasets, including the Common Crawl dataset and the English-language Wikipedia. The model was evaluated against several NLP benchmarks, matching state-of-the-art performance on "closed-book" question-answering tasks and setting a new record for the LAMBADA language modeling task.
OpenAI made headlines last year with GPT-2 and their decision not to release the 1.5 billion parameter version of the trained model due to "concerns about malicious applications of the technology." GPT-2 is one of many large-scale NLP models based on the Transformer architecture. These models are pre-trained on large text corpora, such as the contents Wikipedia, using self-supervised learning. In this scenario, instead of using a dataset containing inputs paired with expected outputs, the model is given a sequence of text with words "masked" and it must learn to predict the masked words based on the surrounding context. After this pre-training, the models are then fine-tuned with a labelled benchmark dataset for a particular NLP task, such as question-answering.
However, researchers have found that the pre-trained models perform fairly well even without fine-tuning, especially for large models pre-trained on large datasets. Earlier this year, OpenAI published a paper postulating several "laws of scaling" for Transformer models. Based on performance data from several different Transformer-based models, OpenAI concluded that model performance (in this case, the cross-entropy loss on the test dataset) has a power-law relationship with the number of model parameters, the size of the dataset, and the amount of compute used for training. Increasing any those three variables would thus improve performance.
For pre-training, the team collected a dataset composed of Common Crawl, WebText, English-language Wikipedia, and two corpora of books. To improve data quality, the researchers filtered Common Crawl to remove redundancies. Because Common Crawl is scraped from the internet, it may contain the actual test data for the benchmark evaluations, which would "taint" the training. The team did attempt to remove this contamination; however, they admit:
Unfortunately, a bug in the filtering caused us to ignore some overlaps, and due to the cost of training it was not feasible to retrain the model.
The team used this data to train eight versions of the model, ranging in size from 125 million parameters to the full 175 billion. The models were evaluated on dozens of NLP benchmarks, in a wide range of categories, with performance near or above state-of-the-art in many cases. To evaluate the model on a news-article generation task, the team used Amazon Mechanical Turk to hire human judges to guess which of a pair of articles was real and which was generated by GPT-3. Humans chose the real article only 52% of the time; in essence, humans were no better than a coin-flip at choosing the real article. The team also discussed some weaknesses of the model. For example, on text synthesis, "GPT-3 samples still sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs." The model also has difficulty with "common sense physics" questions, such as, "If I put cheese into the fridge, will it melt?"
Several members of the NLP-research community have commented on Twitter about the model's size. Alchemy API founder Elliot Turner speculated that the cost to train the largest model could be "nearly $12 million dollars." Prof. Mark Riedl suggests an explanation for the link between model size and performance:
One hypothesis is that GPT-3 has so many parameters (half the number of tokens trained on) that it is starting to act like a memory network.
As with GPT-2, OpenAI has not released the trained model or the code, although there is a GitHub repository containing some of the test datasets as well as a collection of text samples generated by the model.