Chinchilla: Training Compute-Optimal Large Language Models

Isaac Kargar
AIGuys
Published in
5 min readJan 14, 2023

--

The article investigates the optimal model size and the number of tokens for training a transformer language model under a given computation budget. They found that current large language models are undertrained, and by training over 400 language models with varying parameters and the number of tokens, they found that for compute-optimal training, the model size and the number of training tokens should be scaled equally. They test this hypothesis by training a model called Chinchilla which uses the same compute budget as Gopher but with 70B parameters and 4x more data. Chinchilla outperforms Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on a range of downstream evaluation tasks. Chinchilla also uses less computing for fine-tuning and inference and reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, which is a 7% improvement over Gopher.

source

The article presents three different approaches to investigating the relationship between model size and the number of training tokens when working with a fixed FLOPs budget. All three methods start by training a range of models, varying both model size and the number of training tokens, and then use the resulting training curves to fit an empirical estimator of how they should scale. The resulting predictions from all three methods suggest that the parameter count and the number of…

--

--

Isaac Kargar
AIGuys

Co-Founder and CIO @ Resoniks | Ph.D. candidate at the Intelligent Robotics Group at Aalto University | https://kargarisaac.github.io/