Training at Scale: Chinchilla Scaling Laws for Compute-Optimal Training of LLMs

Zain ul Abideen
6 min readJun 26, 2023

--

Exploring Chinchilla’s scaling laws and Meta’s LLaMA model

Introduction

In this blog post, I will be discussing a paper from Google DeepMind in which they perform a lot of experiments on training large language models to find the relation between model size, compute budget, and no. of training tokens. I will also cover Meta’s LLaMA model which was trained by using the results gained from the experiments performed by DeepMind. This blog is part of my blog series on large language models. You can view the previous post on advanced prompting techniques over here Navigating the Prompt Space: Techniques for Effective Prompt Exploration. Chinchilla’s paper refers to the scaling laws for LLMs by OpenAI in a significant manner. So, I’m going to cover the results of their paper first.

Scaling Laws for LLMs by OpenAI

In 2020, OpenAI published a paper “Scaling Laws for Neural Language Models”. They came to the result that the loss scales as a power law with model size, dataset size, and the amount of compute used for training. Network depth and width have minimum effects. These relationships helped them to come to the conclusion that “Larger models are significantly more sample efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.”

For optimal performance, all three factors must be scaled up alongside each other.

Training Compute-Optimal LLMs by DeepMind

This paper was published in 2022. The main goal of this paper was to find the relationship between three factors. These factors are model size, number of tokens, and compute budget. They came to the conclusion that the current LLMs like 175B GPT-3, 280B Gopher, and 530B Megatron are significantly undertrained. All these models have increased the number of parameters but the training data remained constant. The authors mention that for compute-optimal training, the number of training tokens and model size must be scaled equally. They trained about 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens.

Chinchilla outperforms Gopher and the other large models

After finding the relationship between the three factors, they trained a new LLM called Chinchilla which uses same compute budget as 280B Gopher but has 70B parameters and 4 times more training data. Chinchilla outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron (530B). This result is in contradiction to the “Scaling laws for LLMs” by OpenAI. Now, relatively smaller models can give better performance if trained on more data. Smaller models are easy to fine-tune and also have less latency at inference. These models should not be to their lowest possible loss to be compute optimal.

Current LLMs

The main question for their research is “Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?”. They tried three different approaches to answer this question. They have assumed a power-law relationship between compute and model size.

Approach 1: Fix model sizes and vary number of training tokens

In the first approach, they have fixed the model sizes (75M, 250M, 500M, 1B, 2.5B, 5B, 10B) and they are changing the number of training tokens with a fixed number of FLOPS. Using power law, they found out that the optimal model size for Gopher’s compute budget (5.76 × 10^23) is 67B and training tokens should be 1.5 trillion.

Training curve envelope

Approach 2: IsoFLOP profiles

In the second approach, they vary the model size for a fixed set of 9 different training FLOP counts (ranging from 10^18 to 10^21 FLOPs). This approach answers the question “For a given FLOP budget, what is the optimal parameter count?”. While training they suggest that for a model with over 𝐷 tokens, a cosine cycle length that decays 10× over approximately 𝐷 tokens is suggested. This approach suggests that the optimal model size for Gopher’s compute budget is 63B and training tokens should be 1.4 trillion.

IsoFLOP curves.

Approach 3: Fitting a parametric loss function

For the third approach, they have tried to combine the final loss of the above two approaches as a parametric function of model parameters and number of tokens. They proposed a functional form and then minimized the Huber loss to estimate the optimal model size for the Gopher Flop budget is 40B parameters.

Parametric fit

All three approaches suggest that as compute budget increases, model size and the amount of training data should be increased in approximately equal proportions. The first and second approaches yield very similar predictions for optimal model sizes. The third approach suggests that smaller models will be optimal for larger compute budgets. The model Chinchilla that they trained using the above results was trained on MassiveText. It uses AdamW optimizer and SentencePiece tokenizer. At around 80% of the cosine cycle, AdamW passes the performance of model training on the Adam optimizer.

LLaMA models

Meta released a collection of models ranging from 7B to 65B parameters. These models have been trained efficiently using Chinchilla’s scaling laws. These smaller models are cheaper at inference and trained on publicly available datasets. LLaMA-13B outperforms GPT-3 on most benchmarks, despite being 10× smaller. These models are not the fastest to train but they are faster at inference.

Pre-training data

They have used byte-pair encoding. For 6B and 13B parameter models, they have trained with 1T tokens. For 32B and 65B parameter models, they have trained with 1.4T tokens. They have used the basic architectural details of the original Transformer but with few changes from the research of Palm, GPT-3, and other models. They have used pre-normalization, SwiGLU activation function, and rotary embeddings in place of positional embeddings. They have used AdamW optimizer and causal multi-head attention. For efficient implementation, they have given preference to storing the activations in case of recomputing it during backward pass.

Zero-shot performance on Common Sense Reasoning tasks.

Other benchmarks on which LLaMA models have been tested are present in the paper. These open-source and state-of-the-art foundation models have proved that relatively smaller models can outperform large models if trained efficiently and longer.

Closing Remarks

In conclusion, the application of the chinchilla scaling law in training large language models has provided a breakthrough in optimizing compute utilization and achieving efficient training. By recognizing the need to train models on longer time durations and a higher number of tokens, the chinchilla scaling law offers a compute-optimal approach that enhances the performance and capabilities of large language models. The remarkable LLaMA model stands as a testament to the effectiveness of this approach, having been trained on an impressive 1 trillion tokens while maintaining efficiency. In the next blog post, I will be covering language models like Alpaca, Vicuna, and WizardLM. One thing common about these models is that all three of them are fine-tuned versions of the LLaMA model. I will also explain how these models are different with respect to the data that they have collected for efficient fine-tuning.

Thank you for reading!

Follow me on LinkedIn!

--

--