Demystify Transformers: A Guide to Scaling Laws

Published in

Sage Ai

10 min readApr 30, 2024

LLM Scaling Laws

It’s no longer surprising that major cloud service providers and numerous companies are investing heavily in acquiring hundreds of thousands of GPU clusters and leveraging massive amounts of training data to develop Large Language Models (LLMs). Why are larger models considered better? When it comes to LLMs, the loss of next-token (1 token is about 0.75 of words) prediction is predictable and smooth. It turns out that you only need to know two key variables to estimate their accuracy: the total number of parameters in the model (N) and the size of text tokens used to train the model (D). The number of parameters reflects the model’s capacity to learn and represent complex relationships in the data, while the amount of training data allows the model to learn from a wider variety of examples and contexts.

With just these two variables (i.e. N, and D), we can predict the loss of an LLM on next-token prediction tasks. As we train larger models on more data, we continue to see improvements in accuracy. However, transformers’ neural-network architectures have been modified moderately in the past 7 years. We hope future research will come up with more optimal architectures, given that LLMs are known for being energy and data hungry.

The scaling laws of LLMs shed light on how a model’s quality evolves with increases in its size, training data volume, and computational resources. These insights are crucial for navigating the complexities of training larger models and making informed decisions about resource allocation.

Hence, understanding these scaling laws is required for optimizing the development and deployment of LLMs, ensuring that resources are efficiently utilized to achieve the desired performance level.

Additionally, the risk of overfitting is closely linked to the ratio of the model’s size to the dataset’s size, where it is suggested to have D ≥ 5 x 10³ N^(0.74) to avoid model overfitting (see equation 4.4 from OpenAI’s paper). Training curves tend to follow predictable power-law patterns, allowing for rough predictions of future performance based on current training progress.

Empirical Experiments From OpenAI and DeepMind

Fundamentally, researchers are trying to answer the question of LLMs scaling law: Given constrained compute budget measured in FLOPs, floating-point operations, what would be the optimal combination of model size and training data size (measured in number of tokens) that yields the lowest loss?

The compute budget comes in various forms: PF-days (10¹⁵ FLOPs/second x 24 hours x 3600 seconds/hour, arriving at 8.64x10¹⁹ FLOPs) is one of the most common units. More details about FLOPs in Appendix 1.

Fig1: LLM model improves (i.e. test loss decreases) as model size, dataset size and the amount of compute are scaled in tandem.

Various experiments have been conducted to study the scaling laws of language models, involving different model sizes, dataset sizes, and other factors such as:

Model Size (N): Ranges from 768 to 1.5 billion non-embedding parameters.
Dataset Size (D): Varies from 22 million to 23 billion tokens.
Model shape: Includes variations in depth, width, number of attention heads, and feed-forward dimension.
Context Length: primarily 1024, with some experiments using shorter lengths. Note, one may see context length, context window size, or block size from various papers or code implementations. They are all referring to the same thing, which is the longest sequence of continuous tokens that a transformer sees during training and can inference at a time. The code in this blog post uses block size.

Training variables:

Test Cross-Entropy Loss (L): Represents the model’s performance.
Compute (C ): The computational resources used for training the model.

Even though these experiments are almost 4 years old, it provides an insightful understanding of LLMs training requirements. When OpenAI researchers studied LLMs, two significant findings stand out:

Fig2: Model loss depends mildly on model shape when the total of non-embedding parameter N is fixed

Firstly, the impact of scale on the loss of models is more pronounced than the influence of the model’s architectural structure. The scale refers to the number of parameters (N), the size of the dataset (D), and the computational resources (C, which is measured in FLOPs, as explained in Appendix 1) used for training. These factors collectively have a more substantial effect on how well the model loss can be reduced than architectural details (i.e. changing feed-forward ratio, aspect ratio, attention head dimension, etc), as shown in Fig2. This means that increasing the model’s size, using more extensive datasets, and allocating more computational power are likely to yield better results than merely tweaking the model’s structure.

Secondly, as suggested in Fig1, there is a power-law relationship between the performance of the model and each of the scaling factors (N, D, C) when they are not constrained by one another, as demonstrated by the three cases in the Eq1 table:

Case 1: parameters = Nopt (compute-efficient model size), Data = Dopt (compute-efficient data token size), and Compute = Cmin (compute-efficient compute budget)
Case 2: parameters = Infinity, Data = D, and Compute = Early Stop
Case 3: parameters = N, Data = Infinity, and Compute = Infinity

This relationship holds across a wide range of values, indicating a consistent pattern where performance improves predictably as any of these factors increase. In other words, as we scale up the model size, dataset size, or computational resources, we can expect a corresponding and predictable improvement in the model’s performance, following a power-law trend.

Eq1: Scaling law proposed by OpenAI researchers

Empirically, researchers from OpenAI fit the Eq1 (i.e. The case where N, D is finite, early stop, and fixed batch size) and estimated that the Nc = 8.8x10¹³, AlphaN = 0.076, AlphaD = 0.095, and Dc = 5.4x1⁰¹³. Using Llama 3 as an example, it has 70B parameters and 15T tokens. Hence, the loss is L (70x10⁹, 15x10¹²) = 1.72.

Sample-Efficient LLMs

Fig3: LLM’s test loss v.s. dataset size or compute by model size

Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps and using fewer data points. It is evident that, for a given tokens processed, larger models achieve lower test loss, as shown in Fig3.

Fig4: DeepMind suggests a modified scaling law, indicating that models, including Gopher, GPT-3 and Megatron-Turing NLG can be trained with a lot less model parameters. (Left) Optimal number of tokens and parameters for a given FLOP budget (Right) Overlaid predictions of modified scaling law and OpenAI’s scaling law

DeepMind published another paper, detailing various methods for the scaling law of training compute-optimal LLMs, that are different than OpenAI’s. They have conducted over 400 LLMs training and found out that for compute-optimal training, the model size and dataset size (number of tokens) should be scaled equally.

They applied three approaches for scaling law and overlaid their findings with OpenAI’s, as shown in the right of Fig4. They found out that in order to achieve the same model performance, all three approaches suggest much smaller model size. That is with a given compute budget and training data, one can achieve the same model performance with less model size, as shown in the left of Fig4.

Here comes my favorite part: When the researchers at DeepMind modeled the loss as a function of model size and number of tokens, and used the constraint FLOPs (N, D) ~ 6ND, suggested by OpenAI’s researchers, an interesting way of interpretation of LLMs scaling law is emerged: plot out IsoLoss contours or IsoFLOPs slices, as shown in Fig5. To understand the IsoLoss contours plot on the left, let’s use first principles to grasp the situation. For a given IsoLoss (the black lines), we aim to have the least amount of compute capacity, meaning the fewest FLOPs. When you trace all the points representing the fewest FLOPs for all the IsoLosses, these points can be connected as a blue line. This line is known as the efficient frontier, a concept familiar to those who have studied operations research in finance or business school.

One can use the blue line to extrapolate what the optimal model size and predicted loss could be, given a larger compute budget, or in other words, more training FLOPs. For example, in the case of the Gopher model, the optimal model size for a given budget is 40B parameters.

Fig6: Estimated optimal training FLOPs and training tokens for various model size

For optimal compute-efficient training, DeepMind suggests to have ≥ 20 training tokens for every 1 model parameter, as shown in Fig6. To put it into perspective, Llama 3 released by Meta has 215:1 ratio of tokens to parameters.

Eq2: Parametric loss function fitted by DeepMind’s scientists

Unlike OpenAI’s scaling law equation, scientists at DeepMind use different scaling equation, and fit the Eq2 empirically. They learn that E = 1.69, A = 406.4, B = 410.7. Using Llama 3 as an example, it’s a model with 70B parameters and 15T tokens. The loss is L (70x10⁹, 15x10¹²) = 1.86, which is a bit higher than the estimation (i.e. 1.72) using Eq1.

Appendix 1: Computation Budget Measured in FLOPs (Floating point operations)

The compute resources or computational complexities of a model are measured in FLOPs, or floating-point operations. They are used to estimate the amount of computational resources required to train and run a model, as shown in the y-axis of Fig7.

To put it into perspective, it is estimated that OpenAI used 133 billion petaFLOPs to train GPT4. When calculating FLOPs for large language models, it is important to consider all the operations involved in the training process, including those related to embedding matrices. Here’s a breakdown of the FLOPs computation for the forward pass of a transformer model:

# reference: https://github.com/Lightning-AI/litgpt/blob/410a7126f82ea550d4a43dab89367547b073b5a3/litgpt/utils.py#L321

def flops_per_param(max_seq_length: int, n_layer: int, n_embd: int, n_params: int) -> int:
    flops_per_token = 2 * n_params  # each parameter is used for a MAC (2 FLOPS) per network operation
    # this assumes that all samples have a fixed length equal to the block size
    # which is most likely false during finetuning
    flops_per_seq = flops_per_token * max_seq_length
    attn_flops_per_seq = n_layer * 2 * 2 * (n_embd * (max_seq_length**2))
    return flops_per_seq + attn_flops_per_seq
 
def estimate_flops(model: "GPT", training: bool) -> int:
    """Measures estimated FLOPs for MFU.

    Refs:
        * https://ar5iv.labs.arxiv.org/html/2205.05198#A1
        * https://ar5iv.labs.arxiv.org/html/2204.02311#A2
    """
    # using all parameters for this is a naive over estimation because not all model parameters actually contribute to
    # this FLOP computation (e.g. embedding, norm). For this reason, the result will be higher by a fixed percentage
    # (~10%) compared to the measured FLOPs, making those lower but more realistic.
    # For a proper estimate, this needs a more fine-grained calculation as in Appendix A of the paper.
    n_trainable_params = num_parameters(model, requires_grad=True)
    trainable_flops = flops_per_param(
        model.max_seq_length, model.config.n_layer, model.config.n_embd, n_trainable_params
    )
    # forward + backward + gradients (assumes no gradient accumulation)
    ops_per_step = 3 if training else 1
    n_frozen_params = num_parameters(model, requires_grad=False)
    frozen_flops = flops_per_param(model.max_seq_length, model.config.n_layer, model.config.n_embd, n_frozen_params)
    # forward + backward
    frozen_ops_per_step = 2 if training else 1
    return ops_per_step * trainable_flops + frozen_ops_per_step * frozen_flops

It is important to note that this computation only covers the forward pass. For training, the backward pass also needs to be considered, which typically involves a similar amount of computation.

If this is getting too complicated, the rule of thumb is compute requirement (C )is ~6 ND or 8 ND for distributed training, where N is the number of parameters of the transformers, and D is the training data size, measured in tokens. C forward is ~ 2 ND; C backward is ~4 ND.

Fig7: PFLOPs are growing to catch up with the scaling law of LLMs

Appendix 2: Computation Speed Measured in FLOPS (Floating point operations per second)

FLOPS, or Floating point operations per second is the speed of compute. Below is the table of Nvidia A100 and H100 GPUs performance, as shown in Fig8.

One can estimate number of GPUs required to train LLMs based on the table and compute requirement (i.e. the 6 ND estimate). For example, GPT4 requires 133 billion petaFLOPs, which is 1.33 x 10²⁶ FLOPs (or 1.7 trillion parameters and using 13 trillion tokens, resulting in 6 x 1.7 x 10¹² x 13 x 10¹², 1.33 x 10²⁶ FLOPs). Assuming the computation is running using FP16 Tensor Core with sparse, A100 will deliver 624TFLOPS (6.24 x 10¹⁴ floating operation per second). It is reported that OpenAI used 25,000 A100s. Hence, 1.33x10²⁶/(6.24x10¹⁴ x 25,000) = 8525641 seconds or 98.67 days, which sort of aligns with the report and post. Good, the math works out! :)

Fig8: Table of the compute speed of Nvidia GPUs, measured in TFLOPS (10¹² FLOPS)

Summary

We cover various scaling laws of LLMs, examining how the loss of models changes with increased training data and parameter count. The discussion includes an explanation of the IsoLoss contours and IsoFLOPs slices used to interpret LLM scaling laws, providing insights into optimizing computational resources.

Finally, we discuss the concepts of FLOPs and FLOPS, which measure the computational amount and speed, respectively. Using GPT-4 or Llama3 as an example, we clarify the complexities involved in training LLMs. In our next blog post, we will explore strategies for training LLMs at scale using techniques such as model parallelism or data parallelism and the scaling laws of fine-tuning LLMs, providing further insights into efficient large-scale training techniques. We hope you would find this material useful and look forward to hearing your feedback.

Acknowledgement

We thank our colleagues, including Josh Frazier, Srijith Rajamohan, and Jeremiah Edwards for the discussion and interpretation of these papers.