Is it possible for smaller language models to outperform larger ones?

9 min readFeb 21, 2024

Large Language Models (LLMs) have revolutionized the field of artificial intelligence, demonstrating an unprecedented ability to understand and generate human-like text. From translating languages to writing essays, from answering trivia questions to coding assistance, these models have shown remarkable versatility across a wide range of tasks. However, as we continue to push the boundaries of what LLMs can do, these models are becoming increasingly larger. This trend towards bigger models presents both opportunities and challenges. Larger models usually perform better because they can understand complex patterns in the data and provide more accurate responses. On the other hand, the scaling up of these models introduces significant computational and resource challenges.

This leads us to a pivotal question that forms the basis of this research: “Is it possible for smaller language models to outperform larger ones?” Understanding and navigating these scaling laws, govern the relationship between the size of the model (number of parameters), the size of the dataset (number of tokens), and the amount of computation to achieve better performance.

Before we delve into the heart of the matter, it’s crucial to familiarize ourselves with some key concepts that will be pivotal in our discussion. These include parameters, tokens, and computational power, often measured in FLOPS.

Parameter: In the context of Large Language Models (LLMs), a parameter is a variable that the model learns from training data. These parameters, which include things like weights and biases in a neural network, are key to the model’s ability to process language. The more parameters a model has, the larger it is and the more complex language patterns it can understand. These parameters are what enable the model to generate relevant and contextually appropriate responses.

Token: Tokens are the basic units of analysis in natural language processing. In English, a token could be as short as a single character or as long as a word. For example, in the sentence “I love AI”, there are three tokens: “I”, “love”, and “AI”. The size of the dataset in language models is often measured in terms of the number of tokens.

Computational power (FLOPS): FLOPS, or Floating Point Operations Per Second, is a measure of computer performance. In the context of machine learning, it’s used to quantify the amount of computation used in training a model. More FLOPS means more computational power, allowing for more extensive exploration of the model’s parameter space and better optimization of the model’s parameters.

image source

The compute budget for training LLMs :

image source

The chart in discussion provides a comparative analysis of the petaflops (with each petaflop equivalent to 10¹⁵ flops) required per second day to pre-train various versions of Bert and Roberta, both of which are encoder-only models. It also includes T5, an encoder-decoder model, and GPT-3, a decoder-only model. The distinguishing factor among the models within each family is the number of parameters that were trained, which varies from a few hundred million for the Bert base to a staggering 175 billion for the largest GPT-3 variant. It’s important to note that the y-axis is logarithmic, meaning each vertical increment represents a tenfold increase. For instance, T5 XL, with its three billion parameters, necessitated nearly 100 petaFLOP per second-days. In contrast, the larger GPT-3 model, with 175 billion parameters, required a substantial 3,700 petaFLOP per second-days. This chart makes it clear that a huge amount of computers are required to train the largest models.

In 2020, a significant research study was published by OpenAI named “Scaling Laws for Neural Language Models” also known as “Kaplan paper”. This research uncovered a relationship between the size of a model and its performance, demonstrating that as the model size increases, so does its performance.

image source

In the context of the chart, the y-axis represents the test loss, which can be viewed as an indicator of model performance. Lower values on this axis signify better performance. The x-axis, on the other hand, represents the compute budget, measured in units of petaFLOP per second day. Larger values on this axis can be achieved by either increasing the computational power, extending the training duration, or a combination of both. This illustrates the trade-off between computational resources and model performance in the training of Large Language Models.

The study shows that there is a power law relationship between the number of parameters in an LLM and its performance. if the compute budget increases by a factor of 10, the performance Kaplan’s law predicts optimal performance when the model size is increased by 5.5x and the number of training tokens is increased by 1.8x.

In 2022, a team of researchers at Google DeepMind conducted an in-depth analysis of the performance of various sizes of language models and the amount of training data used. Their research aimed to answer a critical question: “Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?”

Kaplan’s study fixed the number of training tokens in their analysis. This assumption prevented them from finding DeepMind’s answer.

To answer the question Google DeepMind published a study named “Training Compute-Optimal Large Language Models” also known as “Chinchilla paper”, which concluded that scaling the number of training tokens is as crucial as scaling the model size. This conclusion suggests that current large language models might be significantly undertrained due to the recent focus on scaling language models while keeping the amount of training data constant.

The figure below shows clearly this problem :

image source

The Chinchilla paper hints that many of the 100 billion parameter large language models like GPT-3 may be overparameterized, meaning they have more parameters than they need to achieve a good understanding of language and are undertrained so that they would benefit from seeing more training data. The authors hypothesized that smaller models may be able to achieve the same performance as much larger ones if they are trained on larger datasets.

The authors suggest that model size and the number of training tokens should be scaled in equal proportions. The figure below shows the difference between the approach of the Chinchilla paper and the Kaplan paper.

image source

Let’s delve into the chinchilla paper:

As it’s shown in the left figure above, for the same compute budget Chinchilla outperforms great LLMs like Gopher (280B), GPT-3 (175B), and Megatron-turing NLG (530B) with much fewer parameters while being trained on much higher tokens (right figure).

The Table below demonstrates the number of parameters as well as the number of training tokens for each LLM.

Authors of the Chinchilla paper trained over 400 models, ranging from 70M to 16B parameters and from 5B to 500B training tokens taking 3 approaches :

1. Fix model sizes and vary number of training tokens :

In this approach, the researchers selected several models with parameters varying from 70M to 10B. Each of these models was trained on four different datasets, each varying in size. This training allowed them to estimate which model had the minimum loss for a given compute budget. By examining the loss value of each trained model, they could determine the optimal model (in terms of parameters/tokens) for a given compute budget. Using this approach, they found that a compute-optimal model, trained with the same amount of compute as the Gopher model, would have 67B parameters and 1.5T tokens.

2. IsoFLOP profiles :

In this approach, the researchers fixed 9 compute budget variations ranging from 6x10¹⁸ to 3x10²¹ and vary the model. The research tries to answer this question: For a given FLOP budget, what is the optimal parameter count?

Using this approach, a compute-optimal model trained with the same amount of compute as Gopher would have 63B params and 1.4T tokens.

3. Fitting a parametric loss function :

In this approach, the researchers modeled all final losses from approaches 1 & 2 as a parametric function of model parameter count and the number of seen tokens.

In the 3D plot presented in the study, the loss decreases as we move towards the top right and increases as we move towards the bottom left. This indicates a correlation between increased compute power, larger datasets, and improved model performance. Interestingly, for a compute budget of 10²² FLOPS, the performance of a model with 1B parameters trained on more data is equivalent to that of a model with 40B parameters. This suggests that increasing the amount of training data can compensate for a smaller model size. Furthermore, given the same compute budget as the Gopher model, the optimal model size is found to be 40B parameters.

In the table below, the researchers aimed to identify the optimal balance between the number of parameters and the number of training tokens for a compute-optimal model. They discovered that with the compute budget used for the Gopher model, a model with 67B parameters would have been more optimal, rather than the 280B parameter model that was used. Furthermore, they found that to optimally train a 280B parameter model, they would need a computing budget that is approximately 17.2 times larger than the budget used for the Gopher model.

Results show gopher model (and modern large language models) is overparametrized unnecessarily.

After training more than 400 models to establish the power law relationship between the number of training tokens and the performance of Large Language Models (LLMs), the researchers used the same computational budget as the Gopher 280B model to train the Chinchilla 70B model. This model is 4 times smaller than Gopher but was trained with 1.4 trillion tokens, which is four times more data than Gopher. The results were significant. Chinchilla outperformed Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) across a wide range of downstream evaluation tasks.

Chinchilla benchmark :

In this article, we will focus on presenting the results from the MMLU benchmark and the Big-bench benchmark. For a more detailed discussion on other benchmarks, please refer to “section 4.2” of the Chinchilla paper. This section provides an in-depth analysis of various benchmarks used in the study.

MMLU :

The performance of Chinchilla on this benchmark is indeed remarkable. Despite its smaller size, Chinchilla significantly outperforms Gopher, achieving an average accuracy of 67.6%, which is a 7.6% improvement over Gopher. What’s even more impressive is that Chinchilla surpasses the expert forecast for June 2023, which predicted an accuracy of 63.4%. Furthermore, Chinchilla outshines Gopher in nearly all tasks, outperforming it in 51 out of 57 tasks.

BIG-bench :

In this benchmark, Chinchilla demonstrates superior performance over Gopher in the majority of tasks, outperforming it in 56 out of 62 tasks. with a 10.7% increase in average performance (For a comprehensive view of the results, refer to “Table A7” in the Chinchilla paper).

Conclusion

Revisiting our initial question, “Is it possible for smaller language models to outperform larger ones?” — the answer is a BIG YES.

Thank you for reading

Please don’t hesitate to reach out with any questions. If you share a passion for AI and its transformative potential, I invite you to connect with me on LinkedIn or explore my GitHub profile for further insights.

Is it possible for smaller language models to outperform larger ones?

Conclusion

Written by Mahmoud Bidry