In January of 2020, independent research organization OpenAI empirically identified trends in the accuracy of neural language models with different architectures, sizes, computing powers, and dataset sizes in a massive computational undertaking only done in the past on fully connected networks. Natural language processing is a subfield used in everything from Google Translate to grammar checkers. However, state of the art models require large amounts of data, model complexities and computing power. The authors found that the three key factors involved in model scale are the number of model parameters (N), the size of the dataset (D), and the amount of compute power (C), while depth and width do not strongly affect training size. Next, performance exhibits a power-law relationship with each of three scale factors. Finally, overfitting occurs on a wide variety of models: if N and D are increased individually, the performance penalty scales on the ratio of N0.74 / D. The factors N and D must be increased in tandem.
When training a model, they determine that transfer learning incurs a constant penalty but otherwise improves roughly in line with performance on the training set. Furthermore, larger models are more sample efficient than smaller models, achieving similar performance with fewer optimization steps. In fact, very large models obtain optimal performance prior to convergence.
The test loss of a transformer was predicted using a power-law when limited by the number of non-embedding parameters (N), dataset size (D), and the optimally allocated compute budget (C). The first scaling law is that for models with a limited number of parameters, trained to convergence on a sufficiently large datasets:
The second scaling law is that for large models trained with limited dataset with early stopping:
(tokens) for models trained with a limited dataset with early stopping.
The third scaling law is that with a sufficiently large dataset, optimally-sized model, and a sufficiently small batch size, the test loss decreases with computing power.
These relationships all hold over 8 orders of magnitudes. The critical batch size is defined by the following equation, inversely proportional to the magnitude of loss.
They train the language models on WebText2, tokenized using byte-pair encodings. They parameterize using hyperparameters nlayer (number of layers), dmodel (dimension of residual stream), dff (dimension of intermediate feed-forward layer), dattn (dimension of the attention output), and nheads (number of attention heads per layer), using the Adam optimizer and Adafactor for a fixed 2.5x105 steps with a batch size of 512 sequences of 1024 tokens. Model size ranged from 768 to 1.5 billion parameters. Dataset size ranged from 22 million to 23 billion tokens. Depth, width, attention heads, and feed-forward dimension all varied. Context length was 1024, and batch size was 219 for most runs. Transformers performed slightly better than LSTMs, but slightly worse than recurrent transformers.