Visualizing the deep learning revolution

Richard Ngo
14 min readJan 5


The field of AI has undergone a revolution over the last decade, driven by the success of deep learning techniques. This post aims to convey three ideas using a series of illustrative examples:

  1. There have been huge jumps in the capabilities of AIs over the last decade, to the point where it’s becoming hard to specify tasks that AIs can’t do.
  2. This progress has been primarily driven by scaling up a handful of relatively simple algorithms (rather than by developing a more principled or scientific understanding of deep learning).
  3. Very few people predicted that progress would be anywhere near this fast; but many of those who did also predict that we might face existential risk from AGI in the coming decades.

I’ll focus on four domains: vision, games, language-based tasks, and science. The first two have more limited real-world applications, but provide particularly graphic and intuitive examples of the pace of progress.


Image recognition

Image recognition has been a focus of AI for many decades. Early research focused on simple domains like handwriting; performance has now improved significantly, beating human performance on many datasets. However, it’s hard to interpret scores on benchmarks in an intuitive sense, so we’ll focus on domains where progress can be visualized more easily.

Image generation

In 2014, AI image generation advanced significantly with the introduction of Generative Adversarial Networks (GANs). However, the first GANs could only generate very simple or blurry images, like the ones below.

Images with yellow borders are real, all others are GAN-generated.

Over the next 8 years, image generation progressed at a very rapid rate; the figure below shows images generated by state-of-the-art systems in each year. Over the last two years in particular, these systems made a lot of progress in generating complex creative scenes in response to language prompts.

This is an astounding rate of progress. What drove it? In part, it was the development of new algorithms — most notably GANs, transformers and diffusion models. However, the key underlying factor was scaling up the amount of compute and data used during training. One demonstration of this comes from the Parti series of image models, which includes four networks of different sizes (with parameter counts ranging from 350 million to 20 billion). Although they were all trained in the same way, for the three prompts below you can clearly see how much better the bigger models are than the smaller ones (e.g. by watching the ability to portray text gradually emerge).

Prompt: A portrait photo of a kangaroo wearing an orange hoodie and blue sunglasses standing on the grass in front of the Sydney Opera House holding a sign on the chest that says Welcome Friends!
Prompt: A map of the United States made out of sushi. It is on a table next to a glass of red wine.
Prompt: A squirrel gives an apple to a bird

Video generation

An early example of video generation came from DeepMind’s 2019 DVD-GAN model. However, although the videos have some realistic features, almost all of the individual videos are noticeably malformed.

More recently, researchers have focused on producing videos in response to text prompts. Here are four sets of examples from different systems released between April and October 2022, showcasing noticeable improvements in that period alone:

Video diffusion (April 2022)
CogVideo (May 2022)
Make-A-Video (September 2022)
Imagen Video (October 2022)


Video games

In 2013, the state of the art for using ML to learn to play video games was beating humans at a few very simple Atari games, like Pong and Breakout.

By 2018, this had progressed to beating professionals at the most complex esports, in particular StarCraft and DOTA2, in which players take thousands of actions in a row throughout games lasting up to an hour.​​

DeepMind’s AlphaStar model, which beat a top professional at StarCraft
OpenAI Five, which beat a team of top professionals at DOTA2

And by 2022, capabilities had extended from narrow games to open-ended environments like Minecraft, where training on videos of human players allowed AIs to learn to take complex sequences of actions involving many different steps. Below you can see the many subtasks involved in making a diamond pickaxe, and a video of the AI carrying them out.

Strategy games

2016 and 2017 saw DeepMind’s AlphaGo system beat the world’s best players at Go, a board game far more complex than chess.

Notably, a later version of AlphaGo trained without any human data, only by playing against itself, reached superhuman level (indicated by the green line in the graph below) after less than three days of training, which involved rediscovering many aspects of Go theory and strategy that humans had developed over thousands of years.

However, chess and Go are still quite simple games — players can only interact in very limited ways. 2022 saw AIs reaching human-level performance at the board game Diplomacy (a strategy game similar to Risk). Diplomacy requires players to form hidden alliances via complex language-based negotiations, as well as deceiving other players when betraying those alliances.

Language-based tasks

The level of language capabilities used in Diplomacy negotiations were enabled by the development over the last few years of large language models (LLMs) trained on billions of words of text. Despite only being trained to predict the next word, in order to do so very accurately LLMs develop a range of emergent skills, including broad knowledge about the world which can be accessed by giving them the right prompts. The main LLMs we’ll discuss in this section are OpenAI’s GPT-2, GPT-3 and ChatGPT models (the last of which you can try online here) and several variants of Google’s PaLM model.

Text generation

While it’s hard to compare the results from different prompts, some samples from different models can give an intuitive sense for the rate of progress.

In 2011, state-of-the-art language models couldn’t even put together grammatical sentences:


The meaning of life is the tradition of the ancient human reproduction: it is less favorable to the good boy for when to remove her bigger. In the show’s agreement unanimously resurfaced. The wild pasteured with consistent street forests were incorporated by the 15th century BE. In 1996 the primary rapford undergoes an effort that the reserve conditioning, written into Jewish cities, sleepers to incorporate the .St Eurasia that activates the population. Mar??a Nationale, Kelli, Zedlat-Dukastoe, Florendon, Ptu’s thought is. To adapt in most parts of North America, the dynamic

The meaning of life is the tradition of the ancient human reproduction: it is less favorable to the good boy for when to remove her bigger. In the show’s agreement unanimously resurfaced. The wild pasteured with consistent street forests were incorporated by the 15th century BE. In 1996 the primary rapford undergoes an effort that the reserve conditioning, written into Jewish cities, sleepers to incorporate the .St Eurasia that activates the population. Mar??a Nationale, Kelli, Zedlat-Dukastoe, Florendon, Ptu’s thought is. To adapt in most parts of North America, the dynamic

A big jump in text generation came with the development of GPT-2 in 2019, which was consistently able to generate coherent sentences containing relevant knowledge about the world.


The 19-year-old singer was caught on camera being escorted out of the store by security guards.
The singer was wearing a black hoodie with the label ‘Blurred Lines’ on the front and ‘Fashion Police’ on the back
The singer was also wearing a pair of black-rimmed glasses, a black jacket, black jeans and black sandals.
She was carrying a pair of black and white striped gloves and a small black bag.

A further jump came with GPT-3 in 2020, which was trained in a very similar way to GPT-2, but with around 100 times more parameters.


After two days of intense debate, the United Methodist Church has agreed to a historic split — one that is expected to end in the creation of a new denomination, one that will be “theologically and socially conservative,” according to the Washington Post. The majority of delegates attending the church’s annual General Conference in May voted to strengthen a ban on the ordination of LGBTQ clergy and to write new rules that will “discipline” clergy who officiate at same-sex weddings. But those who opposed these measures have a new plan: They say they will form a separate denomination by 2020, calling their church the Christian Methodist denomination.
The Post notes that the denomination, which claims 12.5 million members, was in the early 20th century the “largest Protestant denomination in the US”, but that it has been shrinking in recent decades. The new split will be the second in the church’s history. The first occurred in 1968, when roughly 10% of the denomination left to form the Evangelical United Brethren Church. The Post notes that the proposed split “comes at a critical time for the church, which has been losing members for years,” which has been “pushed toward the brink of a schism over the role of LGBTQ people in the church.” Gay marriage is not the only issue that has divided the church. In 2016, the denomination was split over ordination of transgender clergy, with the North Pacific regional conference voting to ban them from serving as clergy, and the South Pacific regional conference voting to allow them.

The next model in the series was ChatGPT, released in 2022. ChatGPT significantly surpasses GPT-3 on a range of tasks; as one example, see its creative writing response to the prompt below:

Common-sense question-answering

Historically, one of the key bottlenecks in AI has been common-sense reasoning. Over the last few years, though, large language models have become capable of giving straightforwardly correct common-sense answers to a wide range of questions. (Note, however, that whether or not a LLM responds correctly to a question often depends sensitively on how the question is asked.)

Here are some explanations of jokes from Google’s 2022 PaLM model (with the top box containing examples given to the model, and then the boxes below containing the model’s own responses):

Here are a range of further capabilities showcased by PaLM:

The most impressive question-answering results come from measurements of the performance of GPT-3.5 (the model behind ChatGPT) and GPT-4 on a range of AP high school exams. GPT-3.5 already beats most humans on many exams; GPT-4 does much better on many of its weaker exams, especially quantitative exams — e.g. going from worse than almost all humans to around median on the AP Calculus exam. The jump is particularly striking given that these models were trained only a few months apart.

Planning and strategic thinking

LLMs have also developed the capability to describe long-term plans, and break them down into a series of tasks. For example, here’s ChatGPT:

While their ability to act on these plans is currently limited, a range of work aims to adapt LLMs to perform computer-based tasks. Here’s an example of the ACT-1 model carrying out a task using a computer interface:

And here’s a video of Google’s PaLM-SayCan robot generating and carrying out a multi-step plan in response to user instructions:


While success at the language-based tasks described above might allow AIs to automate much white-collar work, in the long term the most impactful and concerning capabilities will likely involve AIs capable of automating the process of scientific and technological advancement (including progress in AI development itself). Over the last few years we’ve seen rapid progress towards AIs which are useful in scientific research.


An early step towards automating coding came with GPT-2, where the researchers who trained it noticed a surprising phenomenon: it was capable of autocompleting realistic (although usually buggy) code, because its training data had included a range of code from across the internet.

That was in 2019. In 2020, GPT-3 developed much more sophisticated coding capabilities, with the Codex language model (based on GPT-3) being capable of following language instructions to write whole functions. Here’s an example of Codex writing code to implement a feature in a simple game, based on language instructions.

The full video can be found here.

Here’s an example of Codex writing code to plot a graph:

The full video can be found here.

DeepMind’s AlphaCode system was trained to solve competitive programming problems, and would have placed in the top 54% of humans who participated in the coding competitions it was tested on.

ChatGPT is also capable of answering high-level coding questions, including some at a comparable level to interview questions for hiring software engineers:


The last few years have also seen incredible progress on applying AI to mathematics. AI had already been used to help prove some mathematical theorems, but they required humans to manually specify what parts of the solution to search for. By contrast, modern language models are capable of answering free-form mathematical questions, including some questions used in math competitions.

Here are some examples of Google’s MINERVA model (a variant of PaLM) answering math questions:

Here’s an example of a model from OpenAI which is capable of both informal proofs and formal proofs in the LEAN programming language:

Although none of the examples above show AIs doing novel research, DeepMind’s AlphaTensor model discovered a new algorithm for matrix multiplication which was faster than any designed by humans. While AlphaTensor was specifically developed for this purpose (as opposed to the more general systems discussed above), the result is notable because matrix multiplication is the key step in training neural networks.

Life sciences

The biggest result thus far for AI in the life sciences has been the success of AlphaFold 2, DeepMind’s AI for predicting protein structures, which was described by experts in the field as having solved the protein folding problem (one of the core open problems in biology). While this claim may be somewhat overblown, it was nevertheless a huge leap forward compared with previous attempts:

Some examples of AlphaFold 2’s predictions, compared with the experimental findings:

A more worrying advance comes from a team of drug development researchers, who reversed their algorithm for finding harmless drugs to show how it could potentially generate chemical weapons. The chart below shows the predicted toxicity of compounds generated by their AI; many of them have a lower lethal dose than VX (one of the most lethal toxins known).

Source: Dual use of artificial-intelligence-powered drug discovery (Urbina et al., 2022)

Looking forward

Just to highlight how absurd the rate of progress in AI has been, note that around half of the breakthroughs I’ve mentioned happened in 2022 alone! Overall we’re seeing AI becoming increasingly good at taking actions in complex environments, doing strategic thinking, and then connecting it to the real world. You might worry that showcasing individual examples is a haphazard way to track this progress, compared with measuring more quantitative metrics. However, it’s becoming difficult to design benchmark tasks which remain unsolved for more than a few years, even when they’re intended to track long-term progress. The chart below shows that, while it used to take decades to reach human level on benchmarks, many new benchmarks are being saturated in only a couple of years.

The pace of progress is surprising even to experts in the field. For example, consider the MATH dataset of challenging competition mathematics problems, released in 2021. Professional forecasters predicted that the best accuracy achieved by mid-2022 would be 12.7%; in fact, it was over 50%, well outside their confidence intervals. They also significantly underestimated progress on the 2020 Massive Multitask Language Understanding dataset, which covers elementary mathematics, US history, computer science, law, and more. These forecasters were not experts in ML, but predicted much faster progress than most experts did, and still dramatically underestimated progress on these two benchmarks (although on two other benchmarks their predictions were more accurate).

Source: AI Forecasting: One Year In (Steinhardt, 2022)

One reason that it’s hard to make accurate forecasts is because the compute used in training machine learning systems has been scaling up so dramatically. To get a sense for this, see the graph below showing the computational resources used to train over 100 cutting-edge machine learning systems. Note in particular that the y-axis is on a logarithmic scale — the most compute-intensive system pictured on the graph was trained using over 10 million times more compute than the biggest system from ten years earlier (around the start of the deep learning era).

Source: Compute Trends Across Three Eras of Machine Learning (Sevilla et al., 2022)

The people who have come closest to correctly forecasting how fast progress would be have typically done so by focusing on compute scaling, even when it led to absurd-sounding consequences. In the 1990s, Hans Moravec predicted human-level AI by the 2030s, based on forecasting compute scaling. In 2005, Ray Kurzweil used a similar method to predict a “profound and disruptive transformation in human capability” (which he called the “singularity”) by 2045. In 2010, before the deep learning revolution, DeepMind co-founder Shane Legg predicted human-level AI by 2028 using compute-based estimates. OpenAI co-founder Ilya Sutskever, whose AlexNet paper sparked the deep learning revolution, was also an early proponent of the idea that scaling up deep learning would be transformative. Whether or not the specifics of these predictions are right, they all seem much more plausible now than they did a decade ago, especially when compared to the mainstream consensus that progress in AI would be nowhere near this fast. More recently, in 2020 a team at OpenAI developed scaling laws which claimed that scaling up the amount of compute used in large language models would predictably improve their performance, a claim which has largely been borne out since then (although there have been recent corrections to the specific scaling laws they proposed).

Another thing that many of these forecasters share is serious concern about the possibility that AGI won’t be aligned with human values, and that this might lead to catastrophic outcomes. Shane started and led the DeepMind technical AGI safety team; more recently, Ilya has been spending 50% of his time on alignment research; and the core team behind the scaling laws paper founded Anthropic, an alignment-focused research lab. The possibility that we might be in the run-up to AGI naturally focuses the mind on the risks of bringing such a powerful technology into the world.