Chinchilla: Training Compute-Optimal Large Language Models
The article investigates the optimal model size and the number of tokens for training a transformer language model under a given computation budget. They found that current large language models are undertrained, and by training over 400 language models with varying parameters and the number of tokens, they found that for compute-optimal training, the model size and the number of training tokens should be scaled equally. They test this hypothesis by training a model called Chinchilla which uses the same compute budget as Gopher but with 70B parameters and…