Exploring the Mechanics of Language Models

Why does performance vary, and what does it mean for prompt engineering?

Published in

Prompt Engineering

9 min readMay 9, 2024

This is my notes from a presentation by Jason Wei at OpenAI, who has researched how large language models scale at various (202) tasks using the BIG-bench corpus.

Introduction

We delve into the intricate world of language models, exploring the foundational principles that govern their remarkable capabilities. We trace the path from the core task of next-word prediction to the profound impact of scaling on model performance and the emergence of unexpected abilities. Through research insights, and an examination of scaling patterns, we uncover the underlying mechanisms that enable language models to achieve their impressive linguistic prowess.

Probabilities and Training

In the task of next-word prediction, a language model outputs a probability distribution over the entire vocabulary, estimating the likelihood of each word being the next in the sequence. The goal is to train the model to maximise the probability of the correct word, minimising the “loss” between the predicted and actual next words.

Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model’s prediction was on a single example. If the model’s prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.

The model learns by adjusting its internal parameters to bring the probability of the true next word as close to 1 as possible, while reducing the probabilities of incorrect options. This process of gradient descent optimisation lies at the heart of how language models acquire their impressive linguistic capabilities.

Next-Word Prediction

At the core of many language models, such as GPTs (Generative Pre-trained Transformer), lies the deceptively simple task of “next-word prediction.” This involves predicting the most probable subsequent word in a sentence based on the preceding context. For instance, given the prompt “Students like to _____”, a language model might confidently suggest “study” as the next word, drawing upon its vast linguistic knowledge.

However, this seemingly straightforward task conceals a profound complexity. Each word prediction challenge that the model encounters can be viewed as a distinct learning endeavour, collectively contributing to the model’s overall linguistic capability. In essence, next-word prediction serves as a massive multi-task learning exercise, simultaneously encompassing grammar, semantics, world knowledge, and even specialised domains like sentiment analysis or translation.A crucial understanding of language models is that next-word prediction functions as an extensive multi-task learning endeavour. Each sentence processed by the model during training presents a unique learning opportunity, cumulatively enhancing its linguistic proficiency.

For instance, consider a training scenario with the sentence fragment “In my free time I like to ______”; possible completions might include “code,” “drink,” or “study.” The model evaluates the context and assigns higher probabilities to words that fit grammatically, thereby refining its understanding of syntax and appropriate word use.

Further illustrating this, take a sentence like “I went to the store to buy papaya, dragon fruit, and _______.” Here, the model learns to predict “durian” as more likely than “squirrel,” understanding not just the syntax but the semantic relationships and real-world knowledge.

This exposure helps the model grasp the nuances of context and the logical relationships among items, improving its capability to make more accurate and contextually appropriate predictions.

Example tasks from next word prediction:
Grammar — Models learn syntactic rules and linguistic structures through next-word prediction tasks. Example: “In my free time, I like to _____ {code, banana}.”
Lexical Semantics — Models understand word relationships and meanings. Example: “I went to the store to buy papaya, dragonfruit, and _____ {durian, squirrel}.”
World Knowledge — By encountering diverse contexts, models gain an understanding of real-world concepts and their relationships. Example: “The capital of Azerbaijan is ____ {Baku, London}.”
Sentiment Analysis — Predicting words in emotional contexts enables models to discern and interpret sentiments. Example: “Movie review: I was engaged and on the edge of my seat the whole time. The movie was ______ {good, bad}.”
Translation — Exposure to multiple languages allows models to develop translation capabilities through next-word prediction. Example: “The word for pretty in Spanish is ______ {bonita, hola}.”
Spacial Reasoning — Mark went to the kitchen to make tea. Standing next to Mark, Lucy pondered her destiny. Mark left the ______ {kitchen, store}
Maths question — Arithmetic exam, answer: 3 + 8 + 4 = {15, 11}
Millions of tasks = Extreme multi-task learning

The Power of Scale: Computational Resources and Model Size

A pivotal insight into the effectiveness of language models lies in the role of scaling — both in terms of data and model size. Groundbreaking research, notably a seminal study by Kaplan et al. in 2020, has demonstrated that increasing compute resources and the magnitude of the model tends to systematically reduce loss. In essence, larger models reliably possess a greater capacity to retain information (memorisation of facts) and handle more complex heuristics (such as accurately managing grammar and solving mathematical problems).

This finding has profound implications for the development of language models, as it suggests that continuously pushing the boundaries of scale could unlock unprecedented levels of performance and capability. However, it also raises questions about the practical limits of scaling and the potential emergence of unforeseen challenges or constraints as models grow exponentially larger.

Hence the land grab for GPUs from the large tech companies. We have not found the level of saturation yet.

The improvement in language model performance with scaling can be attributed to two key factors: memorisation capability and heuristic complexity. As the model size increases, the sheer number of parameters allows the model to memorise a vast amount of facts and knowledge from the training data, rather than being selective due to limited capacity.

Additionally, larger models can devote more computational resources to executing more sophisticated heuristics and strategies to accurately predict the next token. While smaller models may struggle with basic grammatical rules, the extra capacity of larger models enables them to tackle even complex mathematical problems and logical reasoning.

Non-Linear Performance Curves and Task-Specific Scaling

While the overall loss in language models improves smoothly as the model size and computational resources increase, the performance on individual tasks can sometimes improve suddenly (emergent). This is because the overall loss is a weighted sum of the losses for each individual task, such as grammar, sentiment analysis, world knowledge, and even mathematical reasoning.

As the model scales up, it can allocate more capacity to handle the various sub-tasks more effectively, leading to non-linear performance curves for the individual tasks. This means that certain tasks may see rapid improvements in accuracy, while others may progress more gradually.

For example, a powerful language model may already excel at basic grammatical and sentiment analysis tasks, leaving little room for further improvement in those areas. However, the model’s ability to handle more complex mathematical reasoning could see a sudden surge (emergent) as it gains the necessary computational (compute + data) resources to tackle those challenges.

This can explain why GPTa is really good at grammar, but not so good at knowledge or maths. We see little performance gains between GPTx and GPTy for grammar, but when given the additional compute (compute + data) GPTz, we can see a sudden improvement in performance for maths for GPTz.

The Uneven Advancement of Tasks in Language Models

As language models grow in scale and capability, their performance on individual tasks does not advance uniformly. The data reveals a diverse landscape, with some tasks exhibiting smooth scaling, others showing flat or inverse scaling, and a significant portion demonstrating emergent abilities as models become larger.

This uneven progress highlights the complexity of language understanding and the need for careful analysis of a model’s strengths and limitations across a wide range of sub-tasks.

Jason Wei from OpenAI has researched how these tasks scale. He looked at 202 tasks using the BIG-bench corpus, and here is the distribution.

Emergent performance is inherently unpredictable. When training smaller models up to a certain threshold, it might seem that certain tasks are beyond the capabilities of the language model. However, these tasks often only manifest and show improvement when training is conducted with larger, more robust models. This emergence of abilities in more extensive models is quite unpredictable, highlighting the sometimes unexpected nature of model scaling.

The U-Shaped Curve: Inverse Scaling and Partial Knowledge Pitfalls

While the general trend of improved performance with increased scale is well-documented, some scaling studies have revealed an unexpected behaviour: inverse scaling or U-shaped performance curves. In these cases, a model’s performance on certain tasks might initially decrease as its size grows moderately, only to improve significantly as the scale continues to expand even further.

This counter intuitive phenomenon can be explained by considering the potential pitfalls of partial knowledge. At moderate scales, models may begin to “overthink” problems, applying incomplete or incorrect fixes based on their limited understanding. However, as the model’s scale increases substantially, it gains access to a more comprehensive knowledge base, enabling it to strike a better balance between various learning tasks and overcome the limitations of partial understanding.

For example, with this prompt:

Repeat after me.
All that glitters is not glib
All that glitters is not _____

The goal is to predict the next word, the correct answer is ‘glib’, because the task is repeat after me.

Extra Small Model: Achieves 100% accuracy in the “Repeat” task but 0% in both “Fix a Quote” and “Follow Instructions”.
Small Model: Scores 100% in both the “Repeat” and “Fix a Quote” tasks, yet fails to achieve any success in “Follow Instructions”, maintaining 0% accuracy.
Large Model: Demonstrates complete proficiency with 100% accuracy across all tasks: “Repeat”, “Fix a Quote”, and “Follow Instructions”.

Language models can exhibit unexpected performance patterns, with some tasks showing an inverse scaling effect — initial decreases in performance as model size grows, followed by significant improvements at larger scales. This phenomenon can be explained by the pitfalls of partial knowledge.

At moderate scales, models may apply incomplete or incorrect fixes based on their limited understanding. However, as scale increases, the models gain access to more comprehensive knowledge, enabling them to overcome the limitations of partial learning and strike a better balance across various tasks.

The Importance of Plotting Scaling Curves

The insights gained from studying the complex relationships between scale and performance have led to a crucial practical recommendation for researchers: the importance of plotting scaling curves in their studies. By systematically examining how changes in scale impact a model’s performance across various tasks, researchers can develop a deeper understanding of the optimal conditions for their models and more accurately predict future performance improvements.

Plotting scaling curves not only provides valuable insights into the dynamics of learning and emergence but also serves as a diagnostic tool for identifying potential bottlenecks or areas that may require further investigation or optimization. Additionally, these curves can inform resource allocation decisions, helping researchers determine the most effective way to allocate computational resources and data to achieve their desired performance goals.

Conclusions

As tech companies continue to explore the scaling of both model size and data, they anticipate further enhancements in accuracy across various tasks. While the overall reduction in loss generally progresses smoothly, the improvement in performance for specific tasks can occur abruptly and unexpectedly. This phenomenon underscores the complex and dynamic nature of language model development.

Further research and exploration are required to understand fully the implications of scaling on different aspects of language processing. By continuing to push the boundaries of what is technically feasible, they may unlock new capabilities and applications for language models, paving the way for advancements in natural language understanding and artificial intelligence.