What Happened to AI Learning Games From Pixels?

9 min readAug 30, 2022

In 2015, DeepMind trained a neural network to play 49 different Atari games from pixels, and its performance was about as good as a human.

Since then, a lot has happened. AlphaGo in 2016, followed by AlphaZero in 2017, shocked the world — Go hadn’t been expected to fall to AIs for another 10 years. In 2019 OpenAI Five and AlphaStar, working with direct information about the game state and a few other advantages, won games against pros in Dota 2 and Starcraft 2. In 2020 MuZero took AlphaZero’s approach and reworked it to allow the agent to learn its own world model, rather than relying on an outside engine to tell it what future game states would look like — this meant that it could now learn to play Atari from pixels. Its performance in Atari was state of the art, which by 2020 meant far exceeding the average human benchmark and in fact breaking the human world record on many of the games. EfficientZero modified MuZero to work more sample efficiently, and, on a subset of 26 of the easier Atari games, beat the average human benchmark with only 2 hours of gametime, the same amount that the human had to familiarise themself with the game before setting the benchmark.

So… in light of all of this amazing progress in reinforcement learning (RL), why are Atari games still the biggest games that any agent has learned from pixels?

Part of the reason is that the Atari benchmark is just a really good benchmark. Since the 2015 agent, some more games have been added, for a total of 57. There’s a wide variety of games, but all with the same basic constraints — the screens are 210x160 RGB images, typically preprocessed into 84x84 greyscale images before they’re given to the agent. There’s a maximum number of 18 different possible actions you can take on any given frame, and all of the Atari games have the simple goal of getting a high score.

An illustration of the original DQN from 2015. As you can see, in terms of possible actions you can take, there’s just one joystick and one button. Enumerating all of the different possible combinations of joystick directions and pressing the button or not pressing the button gives you 18 different actions.

Because you’re running the same agent on 57 different games, you can’t build in tricks that help on one of the games without harming performance on others, and so it’s a very good test at how generally powerful your agent is.

And, if you start to beat it too hard, as has begun to happen — modern state of the art agents simply never die on many of the games, racking up the score for as long as they’re allowed to — you can add constraints that make it more difficult, as in the Atari 100k benchmark that EfficientZero attempted where your agent is only allowed approximately 2 hours of playtime to learn from.

So the Atari benchmark is great. But that’s not the only reason that RL has stayed there. The other reason is that the Atari games are extremely easy compared to, say, your average game you find on Steam.

The same approaches that work so well on Atari fall apart under a number of conditions:

-Bigger state space: if you need to take in big images in order to get the information necessary to perform well, then you’re in trouble. Even MuZero downsized Atari frames to 96x96 RGB images. RL has had very little luck learning anything with a 3-digit resolution.

-Bigger action space: if there are a lot of different actions available to you, again, you’re probably screwed. This is especially the case with MuZero/EfficientZero, since they rely on MCTS or Monte Carlo Tree Search, which gets far more difficult as the number of possible actions rises. Atari has just one joystick and one button. RL agents can handle more actions than that, but not many more.

-Memory requirements: Your standard Atari agent, like the original Deep Q-Network used in the 2015 paper, has no memory and simply takes in the past 4 frames of the game in order to figure out information like velocity. This is actually all you need for most Atari games, but if your random Steam game that you have your eye on needs you to remember information that is not immediately available on the screen then you’re in trouble, because we have no good way of doing this. The current best method, LSTMs, can remember things for about 10 seconds under ideal circumstances, at the cost of drastically slowing down training.

-Deceptive Environments: Atari games typically have a very nice reward environment where if you just move randomly, you’ll end up doing a few good things like shooting bad guys, which gets you reward. When the environment is even a tiny bit less obvious about telling you what you should be doing, RL agents have a very hard time. On Atari Pitfall, where moving randomly ends up doing worse than staying still because hitting an obstacle gets you negative score, MuZero played for 20 billion frames — over 10 years of game time — and all it learned was to stay perfectly still for a score of 0. Meanwhile, the human who played for 2 hours got a score of 6,464.

So in order for a modern RL agent to be successful, we’re hoping for a small state space, a small action space, all of the necessary information should be on the screen at all times, and moving completely randomly should get you some positive reward. Think of a game you like on Steam — how many of those conditions does it satisfy? If even one of these conditions isn’t met, you’re already in trouble — Atari Pitfall satisfied the first 3/4, but because the reward was deceptive, MuZero and most other Atari agents flunk it.

Now, there are agents specifically designed to perform well on environments like Atari Pitfall that can get good results there. If your game just fails one condition you’ve still got a shot. But if more than one of these conditions isn’t met, then chances are that even an all-out attempt by a major AI research organization will fail to produce human-level results.

As an example of this, let’s look at the most important project in reinforcement learning in years: VPT.

VPT: Bringing Scale To Minecraft

While RL has been languishing in Atari, LLMs (large language models) like GPT-3 have been making incredible progress off the back of training transformer models on massive amounts of text. In light of this, it’s natural to ask — if RL took a page from GPT’s book, what would happen?

Video PreTraining(VPT): Learning to Act by Watching Unlabeled Online Videos is a recent paper by OpenAI that aims to find out. It’s an ambitious project aiming to use big data & scale to take on an extremely difficult RL task, Minecraft — specifically, crafting a diamond pickaxe.

The path to crafting a diamond pickaxe in Minecraft

Minecraft is a good candidate for a big data approach because there is a massive amount of Minecraft videos available online on Youtube. Unfortunately, you can’t tell what keys the players are pressing in these videos. So first, they pay some people to play Minecraft while recording their keystrokes, to produce labelled data where they know exactly what actions were taken when. They then use that labelled data to train a model called IDM that can fill in what actions were taken given gameplay. Since this is a supervised learning problem where the IDM can look at both past and future frames to determine what action was taken at a given time, it is highly accurate. The IDM is then used to label the Youtube videos. Through this, they acquire a large dataset of humans playing Minecraft that their agent can learn from.

Next, they train a transformer model, which they call the VPT foundation model, to predict the next move a player would make from a given Minecraft state the same way that GPT would predict the next word that would come in an essay. This model takes in the past 128 frames while playing the game at 20 FPS, meaning it uses the last 6 seconds of gameplay to predict the next action played. They then fine tune the foundation model, first on videos specifically focused on the early game, and then with reinforcement learning, having it play Minecraft itself and giving it a reward for each step it takes along the path to a diamond pickaxe.

This was a huge project. Training the VPT Foundation Model “took ~9 days on 720 V100 GPUs”. After that: “Experiments ran for approximately 6 days (144 hours) on 80 GPUs (for policy optimization) and 56,719 CPUs (mostly for collecting rollouts from Minecraft).”

There was also a step up in the number of pixels. No more are the days of neural networks requiring tiny 84x84 or 96x96 images. The Minecraft images were downsampled to 128x128. We’ve finally broken into triple digits!

How did it do? The big result, and a breakthrough in RL on Minecraft, is that it can get a diamond pickaxe 2.5% of the time. When human players were asked to get a diamond pickaxe, they were successful 12% of the time, so this is a subhuman result. To be specific, it’s not “subhuman performance in Minecraft” — it’s, “subhuman performance in making a diamond pickaxe in Minecraft”, a much smaller and more specific task.

This isn’t meant to knock the paper — as I said before, I think this is the most important paper in RL in years. But its importance is as a null result, showing that the breakthroughs in language modelling do not yet show any sign of causing huge breakthroughs in RL. Furthermore, it shows how difficult reinforcement learning is in bigger, though still objectively tiny, environments. Previously, that difficulty was left unstated because big players didn’t even attempt it — instead, if they were attempting a larger environment at all, they would give their agent direct information about the environment and various other advantages over humans in order to put up eyecatching headlines. Now that difficulty is out in the open for everyone to see.

What Does This Mean?

There are various problems in RL that we are nowhere close to solving. World modelling is one of them — the current state of the art approach to modelling Atari, MCTS, used by MuZero/EfficientZero, is a complete nonstarter for larger environments. Memory is another — the state of the art method for memory, an LSTM, again seems to be a nonstarter. Exploration is also a field where there are many good ideas, none of which seem to work well. Agents that break through into higher dimensional environments tend to avoid all of these — they have no world model, no memory besides taking in the past X frames or turns of gameplay, and exploration is handled either randomly or by first learning from human play.

And it’s because they have to avoid all of these problems that breakthroughs are so rare and restricted. Did you know that after AlphaZero crushed Go in 2017, it took 5 years before DeepMind cracked a larger board game, Stratego? They had to ditch the tree search used by AlphaZero/MuZero in order to achieve it, and their agent, DeepNash, didn’t achieve superhuman performance (reaching rank 3 in the world).

So the main thing to take from this is that reinforcement learning is very hard and is progressing slowly, with multiple unsolved problems and many more likely lurking down the road. This is what Yann LeCun is talking about in his thread on AGI (his first point is just semantics, but the rest of his points take the problem of achieving human-level AI seriously). And, attempts to leverage what we’ve learned from LLMs like GPT-3 don’t seem to lead to huge breakthroughs — VPT had video of people successfully crafting diamond pickaxes in its training set and it still couldn’t learn to do it properly.

Predicting the future is hard, but an AI that can interact with the real world at fruit-fly level seems to be quite a ways off.

What Happened to AI Learning Games From Pixels?

Written by 307th