Paper From Google Shines Light On The Future Of AI ⭕

Scaling Models Causes New Abilities To Emerge

The Decoding
CodeX
4 min readJan 13, 2023

--

AI Progress NLP GPT-4
Photo by Chris Liverani on Unsplash

Last year, large language models (LLM) have broken record after record. ChatGPT got to 1 million users faster than Facebook, Spotify, and Instagram did. They helped create billion-dollar companies, and most notably they helped us recognize the divine nature of ducks.

2023 has started and ML progress is likely to continue at a break-neck speed. This is a great time to take a look at one of the most interesting papers from last year.​

Emergent Abilities in LLMs

In a recent paper from Google Brain, Jason Wei and his colleagues allowed us a peak into the future. This beautiful research showed how scaling LLMs will might allow them, among other things, to [1]:

  • Become better at math
  • Understand even more subtleties of human language
  • reduce hallucination and answer truthfully

(See the plot on break-out performance below for a full list)

Some Context:

If you played around with ChatGPT or any of the other LLMs, you will likely have been as impressed as I was. However, you have probably also seen the models go off the rails here and there. The model might hallucinate gibberish, give untrue answers, or fail at performing math.

Why does this happen?

LLMs are commonly trained by maximizing the likelihood over all tokens in a body of text [2]. Put more simply, they learn to predict the next word in a sequence of words.

Hence, if such a model learns to do any math at all, it learns it by figuring concepts present in human language (and thereby math).

Let’s look at the following sentence.

“The sum of two plus two is …”

The model figures out that the most likely missing word is “four”.

The fact that LLMs learn this at all is mind-bending to me! However, once the math gets more complicated LLMs begin to struggle.

There are many other cases where the models fail to capture the elaborate interactions and meanings behind words. One other example is words that change their meaning with context. When the model encounters the word “bed”, it needs to figure out from the context, if the text is talking about a “river bed” or a “bed” to sleep in.

What they discovered:

For smaller models, the performance on the challenging tasks outline above remains approximately random. However, the performance shoots up once a certain number of training FLOPs (a proxy for model size) is reached.

The figure below visualizes this effect on eight benchmarks. The critical number of training FLOPs is around 10²³. The big version of GPT-3 already lies to the right of this point, but we seem to be at the beginning stages of performance increases.

Break-Out Performance At Critical Scale

Break-Out Performance At Critical Scale

They observed similar improvements on (few-shot) prompting strategies, such as multi-step reasoning and instruction following. If you are interested, I also encourage you to check out Jason Wei’s personal blog. There he listed a total of 137 emergent abilities observable in LLMs.

Looking at the results, one could be forgiven for thinking: simply making models bigger will make them more powerful. That would only be half the story.

(Language) models are primarily scaled along three dimensions: number of parameters, amount of training compute, and dataset size. Hence, emergent abilities are likely to also occur with e.g. bigger and/or cleaner datasets.

There is other research suggesting that current models, such as GPT-3, are undertrained [3]. Therefore, scaling datasets promises to boost performance in the near-term, without using more parameters.

So what does this mean exactly?

This beautiful paper shines a light on the fact that our understanding of how to train these large models is still very limited. The lack of understanding is largely due to the sheer cost of training LLMs. Running the same number of experiments as people do for smaller models would cost in the hundreds of millions.

However, the results strongly hint that further scaling will continue the exhilarating performance gains of the last years.

Such exciting times to be alive!

Thank you for reading! It was a privilege to make this for you and I sincerely hope you found it useful.

At The Decoding ⭕, I send out a thoughtful newsletter about ML research and the data economy once a week.
No Spam. No Nonsense. Click here to sign up!

References:

[1] J. Wei, Y. Tai, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, W. Fedus, Emergent Abilities of Large Language Models (2022), Transactions on Machine Learning Research

[2] A. Radford, K. Narasimhan, T. Salimans, B. Sutskever, Improving Language Understanding by Generative Pre-Training (2018), OpenAI Blog

[3] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. Rae, O. Vinyals, L. Sifre, Training Compute-Optimal Large Language Models (2022), Computation and Language

--

--

The Decoding
CodeX

The Newsletter that Makes You Smarter about Machine Learning, Startups, and the Data Economy