OpenAI Approximates Scaling Laws for Neural Language Models

In January of 2020, independent research organization OpenAI empirically identified trends in the accuracy of neural language models with different architectures, sizes, computing powers, and dataset sizes in a massive computational undertaking only done in the past on fully connected networks. Natural language processing is a subfield used in everything from Google Translate to grammar checkers. However, state of the art models require large amounts of data, model complexities and computing power. The authors found that the three key factors involved in model scale are the number of model parameters (N), the size of the dataset (D), and the amount of compute power (C), while depth and width do not strongly affect training size. Next, performance exhibits a power-law relationship with each of three scale factors. Finally, overfitting occurs on a wide variety of models: if N and D are increased individually, the performance penalty scales on the ratio of N^0.74 / D. The factors N and D must be increased in tandem.

When training a model, they determine that transfer learning incurs a constant penalty but otherwise improves roughly in line with performance on the training set. Furthermore, larger models are more sample efficient than smaller models, achieving similar performance with fewer optimization steps. In fact, very large models obtain optimal performance prior to convergence.

The test loss of a transformer was predicted using a power-law when limited by the number of non-embedding parameters (N), dataset size (D), and the optimally allocated compute budget (C). The first scaling law is that for models with a limited number of parameters, trained to convergence on a sufficiently large datasets:

The second scaling law is that for large models trained with limited dataset with early stopping:

(tokens) for models trained with a limited dataset with early stopping.

The third scaling law is that with a sufficiently large dataset, optimally-sized model, and a sufficiently small batch size, the test loss decreases with computing power.

These relationships all hold over 8 orders of magnitudes. The critical batch size is defined by the following equation, inversely proportional to the magnitude of loss.

They train the language models on WebText2, tokenized using byte-pair encodings. They parameterize using hyperparameters n_layer(number of layers), d_model (dimension of residual stream), d_ff (dimension of intermediate feed-forward layer), d_attn(dimension of the attention output), and n_heads (number of attention heads per layer), using the Adam optimizer and Adafactor for a fixed 2.5x10⁵ steps with a batch size of 512 sequences of 1024 tokens. Model size ranged from 768 to 1.5 billion parameters. Dataset size ranged from 22 million to 23 billion tokens. Depth, width, attention heads, and feed-forward dimension all varied. Context length was 1024, and batch size was 2¹⁹ for most runs. Transformers performed slightly better than LSTMs, but slightly worse than recurrent transformers.

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter