Two minutes NLP — Scaling Laws for Neural Language Models
Relations between model performance, model size, model shape, and compute budget
Hello fellow NLP enthusiasts! Today we delve into a study on the relations between language model performance and parameters like model scale, model shape, and compute budget. The article is a small summary made of extracts from the paper “Scaling Laws for Neural Language Models”, which I highly recommend. Enjoy! 😄
A study on language modeling performance
The paper Scaling Laws for Neural Language Models contains a study of empirical scaling laws for language model performance on the cross-entropy loss, focusing on the Transformer architecture.
From the experiments, it turns out that the test loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. This means that simple equations govern the relationships between these variables, and these equations can be used to create an optimally efficient training configuration for training a very large language model. Moreover, it looks like other architectural details such as network width or depth have minimal effects within a wide range.
As can be deduced from the experiments and the derived equations, larger models are significantly more sample efficient, i.e. optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
Experiments
To study language model scaling, a variety of models have been trained with different factors including:
- Model size (N): ranging in size from 768 to 1.5 billion non-embedding parameters.
- Dataset size (D): ranging from 22 million to 23 billion tokens.
- Model shape: including depth, width, attention heads, and feed-forward dimension.
- Context length: 1024 for most runs, with some experiments with shorter contexts.
- Batch size: 2^19 for most runs, with some variations to measure the critical batch size. Training at the critical batch size provides a roughly optimal compromise between time and compute efficiency.
Let’s define the following training variables as well:
- Let L be the test cross-entropy loss.
- Let C be the amount of compute used to train a model.
Key findings
Taking inspiration from section 1.1 of the paper, we summarize the results of the experiments.
- Performance depends strongly on model scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters N (excluding embeddings), the size of the dataset D, and the amount of compute C used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width.
- Smooth power laws: Performance has a power-law relationship with each of the three scale factors N, D, C when not bottlenecked by the other two, with trends spanning more than six orders of magnitude.
The paper differentiates between embedding and non-embedding parameters because their size correlates differently with model performance. When including embedding parameters, performance appears to depend strongly on the number of layers in addition to the number of parameters. When excluding embedding parameters, the performance of models with different depths converges to a single trend.
- Universality of overfitting: Performance improves predictably as long as we scale up N and D in tandem, but enters a regime of diminishing returns if either N or D is held fixed while the other increases.
- Universality of training: Training curves follow predictable power-laws whose parameters are roughly independent of the model size. By extrapolating the early part of a training curve, it’s possible to roughly predict the loss that would be achieved if trained for much longer.
- Sample efficiency: Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps and using fewer data points.
- Convergence is inefficient: When working within a fixed compute budget C but without any other restrictions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence.
Taken together, these results show that language modeling performance improves smoothly and predictably as we appropriately scale up model size, data, and compute. Conversely, we find very weak dependence on many architectural and optimization hyperparameters. It is expected that larger language models will perform better and be more sample efficient than current models.
Considerations
It’s possible to use the relations between N, D, and L to derive the compute scaling, magnitude of overfitting, early stopping step, and data requirements when training large language models.
The derived scaling relations can be used as a predictive framework. One might interpret these relations as analogs of the ideal gas law, which relates the macroscopic properties of a gas in a universal way, independent of most of the details of its microscopic constituents.
It would be interesting to investigate whether these scaling relations hold in other generative modeling tasks with a maximum likelihood loss, and perhaps in other settings and domains (such as images, audio, and video models) as well.
Conclusions and next steps
In this article, we saw the relations between language model performance and model size, model shape, and compute budget. These relations can be used to derive the optimally-efficient compute budget of a fixed large language model that we want to train, or vice-versa to derive the optimally-efficient model (in terms of model size and shape) to train given a fixed compute budget.
Possible next steps are:
- Learn about how BigScience used these findings to efficiently train a large language model on a fixed compute budget.
- Learn about huge sparse language models.