Training Compute-Optimal Large Language Models: DeepMind’s 70B Parameter Chinchilla Outperforms 530B Parameter Megatron-Turing
Today’s extreme-scale language models have demonstrated astounding performance on natural language processing tasks, attributed mainly to their ever-expanding size, which can surpass 500 billion parameters. But as such models have scaled dramatically in recent years, the amount…