Large Language Models’ emergent abilities: how they solve problems they were not trained to address?

BoredGeekSociety
6 min readDec 12, 2022

--

There is no doubt about it. Large language models are going to play a very important role in our lives from now on!

The example of ChatGPT shows how these models are capable of tackling a variety of complex tasks such as

  • Content generation,
  • Text summary,
  • Translation,
  • Code generation,
  • Question answering,
  • and so on.

Check this article to read more on its abilities and limitations: “why ChatGPT isn’t ready to kill Google yet?”

Yet, the most fascinating part of it is the fact that none of these models were trained to tackle these tasks in the first place! They are what the AI community refers to as “emergent abilities”.

What are emergent abilities and how is this possible?

Training large language models — unsupervised learning on a massive dataset

Training a large language model requires two things: an input dataset (massive and secret in the case of most LLMs today), and an objective function, i.e., the function the model needs to optimize to perform well on the task at hand.

The objective function of a large language model such as GPT-3 is to predict the next word in a sequence of words.

The model does this by

  1. taking in a sequence of words as input
  2. and outputting a probability distribution over the possible words that could come next in the sequence.

The model is trained to maximize the likelihood that it will predict the correct next word in the sequence.

Yes, that’s basically it. Its goal is predicting the next word. We don’t even need to label the data, the process is unsupervised, which makes it a compelling, yet very scalable approach for language models!

The Fascinating Emergent Abilities

“An ability is emergent if it is not present in smaller models but is present in larger models” (source)

Prompting large language models

While a language model such as GPT-3 is primarily trained to predict the next word in a sequence of words, it proved to be useful for a variety of other natural language processing tasks as mentioned before. These abilities are called “emergent abilities”, i.e., the model was not trained to address them, and smaller models can’t tackle them in a satisfactory way ; yet, when the model reaches a certain tipping point, in terms of size, it suddenly displays new abilities, such as translation, entity recognition, summarization, arithmetic computations, code generation, and so on.

“The figure below shows three examples of emergent abilities. The ability to perform arithmetic, take college-level exams (Multi-task NLU), and identify the intended meaning of a word all become non-random only for models with sufficient scale (in this case, we measure scale in training FLOPs). Critically, the sudden increase in performance is not predictable simply by extrapolating from the performance of smaller models.” (source)

Prompting Strategies and Chain-of-Thought

It’s even possible to enhance the model’s ability at solving new tasks by tweaking your prompting strategies and asking the model to detail, step by step, its resolution steps (either by requesting it in the prompt or by including examples).

Chain of Thought Prompting example
Chain of Thought Prompting example (source)

How can we explain emergent abilities?

Short answer: there is no satisfactory explanation today ; we’re learning everyday, from experience, what such models are capable of doing and their limitations.

Longer answer: We can still however theorize about it. Here are the main possible explanations that, we can confidently say, contribute to the actual answer:

There are several factors that can contribute to the emergent abilities of large language models. These include the size of the model, the quality of the training data, and their inner semantic representation.

Dataset Size & Quality

Size: One of the key factors in the performance of a large language model is its size, or the number of parameters it has. Large models with a high number of parameters are able to capture more complex patterns in the data, and therefore can often produce more accurate and diverse outputs.

Quality: The quality and quantity of the training data also plays a significant role in the performance of a large language model. Models trained on a large and diverse dataset (e.g., containing text addressing content from a variety of topics, in different languages, including code and instructions, etc.) are more likely to be able to generalize to a wide range of inputs, and therefore produce more accurate and relevant outputs.

Semantic Representation

Emergent abilities in large language models could be based on the ability of the model to build a implicit semantic representation of the learned data (usually referred to as “embedding space”). This means that the model is able to represent* the context of the words and phrases in the input data, and can use this information to generate more accurate and relevant outputs.
*I carefully use the world “represent” instead of “understand” on purpose.

(Overly simplified) Illustrations:

  • The prompt is an instruction: the model finds the relevant part of the network semantically close to instruction related concepts. Thus the model will follow similar patterns as those ingested during the learning phase and will complete the text accordingly.
  • The prompt asks for code: the model probably encoded code related knowledge in a different “code” subspace, and can tap into it to solve its issue.
  • The prompt is asked to answer while detailing step by step its resolution steps: it can probably relate to exam-like answers and enhance its output based on the same patterns.

Why is this explanation plausible? The Tunisian dialect example.

Disclamer: this is an overly simplified explanation that aims at providing a conceptual understanding of what might be happening under the hood; Please share with me your own interpretation, or any scientific paper relevant to the topic (whether it agrees with this explanation or not).

A while ago I tackled a sentiment analysis problem related to the Tunisian dialect. The hardest part about this dialect? Tunisian’s speak a mix of Tunisian Arabic, French and English, and write it relying on both Latin alphabet (including numbers to represent certain sounds), and Arabic alphabet.

The most efficient approach was to build a custom language model (based on BERT) and then fine tune it to perform the classification tasks, leading to about 95% accuracy.

Let’s look at the embedding space projection in two dimensions.

Model’s embedding representation of the Tunisian dialect.
Model’s embedding representation of the Tunisian dialect, test set sentences’ view (in french)

How to read this representation? We computed the embedding representation of a sample of Tunisian sentences.

  • In blue: positive sentiment
  • In orange: negative sentiment

Where blue and orange dots conflict in the middle: that’s where ambiguous sentences are (where they can have both positive and negative meaning).

  • Top samples: written in Latin alphabet
  • Bottom samples: written in Arabic alphabet

The language models learned, by itself, to treat differently the different alphabets ; the classifier learned to classify sentiments based on the supervised training dataset.

Now, imagine this for thousands of dimensions across multiple topics and languages!

The inner semantic representation, in the neural network, could be key to helping the model quickly find the most relevant sub networks, and then leverage them to perform what it was trained for: complete the text!

What’s next?

Emergent Abilities are quite fascinating! What else could these models perform? how could we integrate them in modern applications? what are their risks and limitations?

The AI community is barely scratching the surface of what large language models can do. We are all learning as we go. With larger & richer models, such as GPT4, being announced, the next big thing will be “multi-modal very large language models”, and they’ll probably come with more emergent abilities and astonishing surprises! Stay Tuned!

If you liked this post, please consider supporting us: 🔔 clap & follow 🔔

--

--

BoredGeekSociety

Wassim Jouini, CTO and Head of AI @LegalPlace. 12yrs+ Building AI & Automation products for Scale-Up Startups | Ph.D. AI