Two minutes NLP — Scaling Laws for Neural Language Models

Relations between model performance, model size, model shape, and compute budget

Fabio Chiusano
NLPlanet
5 min readMar 18, 2022

--

Hello fellow NLP enthusiasts! Today we delve into a study on the relations between language model performance and parameters like model scale, model shape, and compute budget. The article is a small summary made of extracts from the paper “Scaling Laws for Neural Language Models”, which I highly recommend. Enjoy! 😄

A study on language modeling performance

The paper Scaling Laws for Neural Language Models contains a study of empirical scaling laws for language model performance on the cross-entropy loss, focusing on the Transformer architecture.

From the experiments, it turns out that the test loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. This means that simple equations govern the relationships between these variables, and these equations can be used to create an optimally efficient training configuration for training a very large language model. Moreover, it looks like other architectural details such as network width or depth have minimal effects within a wide range.

As can be deduced from the experiments and the derived equations, larger models are significantly more sample efficient, i.e. optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

Experiments

To study language model scaling, a variety of models have been trained with different factors including:

  • Model size (N): ranging in size from 768 to 1.5 billion non-embedding parameters.
  • Dataset size (D): ranging from 22 million to 23 billion tokens.
  • Model shape: including depth, width, attention heads, and feed-forward dimension.
  • Context length: 1024 for most runs, with some experiments with shorter contexts.
  • Batch size: 2^19 for most runs, with some variations to measure the critical batch size. Training at the critical batch size provides a roughly optimal compromise between time and compute efficiency.

Let’s define the following training variables as well:

  • Let L be the test cross-entropy loss.
  • Let C be the amount of compute used to train a model.

Key findings

Taking inspiration from section 1.1 of the paper, we summarize the results of the experiments.

  • Performance depends strongly on model scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters N (excluding embeddings), the size of the dataset D, and the amount of compute C used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width.
  • Smooth power laws: Performance has a power-law relationship with each of the three scale factors N, D, C when not bottlenecked by the other two, with trends spanning more than six orders of magnitude.
Language modeling performance improves smoothly as we increase the amount of compute, dataset size, and model size used for training. For optimal performance, all three factors must be scaled up in tandem. Image from https://arxiv.org/pdf/2001.08361.pdf.

The paper differentiates between embedding and non-embedding parameters because their size correlates differently with model performance. When including embedding parameters, performance appears to depend strongly on the number of layers in addition to the number of parameters. When excluding embedding parameters, the performance of models with different depths converges to a single trend.

Left: When including embedding parameters, performance appears to depend strongly on the number of layers in addition to the number of parameters. Right: When excluding embedding parameters, the performance of models with different depths converges to a single trend. Image from https://arxiv.org/pdf/2001.08361.pdf.
  • Universality of overfitting: Performance improves predictably as long as we scale up N and D in tandem, but enters a regime of diminishing returns if either N or D is held fixed while the other increases.
The early-stopped test loss depends predictably on the dataset size D and model size N. Left: For large D, performance is a straight power law in N. For a smaller fixed D, performance stops improving as N increases and the model begins to overfit. Right: The extent of overfitting depends predominantly on a relationship between N and D. Image from https://arxiv.org/pdf/2001.08361.pdf.
  • Universality of training: Training curves follow predictable power-laws whose parameters are roughly independent of the model size. By extrapolating the early part of a training curve, it’s possible to roughly predict the loss that would be achieved if trained for much longer.
  • Sample efficiency: Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps and using fewer data points.
A series of language model training runs, with models ranging in size from 10^3 to 10^9 parameters (excluding embeddings). Image from https://arxiv.org/pdf/2001.08361.pdf.
Left: The early-stopped test loss L(N, D) varies predictably with the dataset size D and model size N. Right: After an initial transient period, learning curves for all model sizes N can be fit with an equation parameterized in terms of the number of steps (Smin) when training at large batch size. Image from https://arxiv.org/pdf/2001.08361.pdf.
  • Convergence is inefficient: When working within a fixed compute budget C but without any other restrictions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence.
As more compute becomes available, it’s possible to choose how much to allocate towards training larger models, using larger batches, and training for more steps. This image illustrates this for a billion-fold increase in compute. For optimally compute-efficient training, most of the increase should go towards increased model size. A relatively small increase in data is needed to avoid reuse. Of the increase in data, most can be used to increase parallelism through larger batch sizes, with only a very small increase in serial training time required. Image from https://arxiv.org/pdf/2001.08361.pdf.

Taken together, these results show that language modeling performance improves smoothly and predictably as we appropriately scale up model size, data, and compute. Conversely, we find very weak dependence on many architectural and optimization hyperparameters. It is expected that larger language models will perform better and be more sample efficient than current models.

Considerations

It’s possible to use the relations between N, D, and L to derive the compute scaling, magnitude of overfitting, early stopping step, and data requirements when training large language models.

The derived scaling relations can be used as a predictive framework. One might interpret these relations as analogs of the ideal gas law, which relates the macroscopic properties of a gas in a universal way, independent of most of the details of its microscopic constituents.

It would be interesting to investigate whether these scaling relations hold in other generative modeling tasks with a maximum likelihood loss, and perhaps in other settings and domains (such as images, audio, and video models) as well.

Conclusions and next steps

In this article, we saw the relations between language model performance and model size, model shape, and compute budget. These relations can be used to derive the optimally-efficient compute budget of a fixed large language model that we want to train, or vice-versa to derive the optimally-efficient model (in terms of model size and shape) to train given a fixed compute budget.

Possible next steps are:

Thank you for reading! If you are interested in learning more about NLP, remember to follow NLPlanet on Medium, LinkedIn, and Twitter!

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence