Chinchilla: Training Compute-Optimal Large Language Models

Published in

AIGuys

5 min readJan 14, 2023

The article investigates the optimal model size and the number of tokens for training a transformer language model under a given computation budget. They found that current large language models are undertrained, and by training over 400 language models with varying parameters and the number of tokens, they found that for compute-optimal training, the model size and the number of training tokens should be scaled equally. They test this hypothesis by training a model called Chinchilla which uses the same compute budget as Gopher but with 70B parameters and 4x more data. Chinchilla outperforms Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on a range of downstream evaluation tasks. Chinchilla also uses less computing for fine-tuning and inference and reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, which is a 7% improvement over Gopher.

The article presents three different approaches to investigating the relationship between model size and the number of training tokens when working with a fixed FLOPs budget. All three methods start by training a range of models, varying both model size and the number of training tokens, and then use the resulting training curves to fit an empirical estimator of how they should scale. The resulting predictions from all three methods suggest that the parameter count and the number of…

Chinchilla: Training Compute-Optimal Large Language Models

Written by Isaac Kargar