MuZero: The Walkthrough (Part 1/3)

Teaching a machine to play games using self-play and deep learning…without telling it the rules 🤯

David Foster
Applied Data Science
7 min readDec 2, 2019

--

If you want to learn how one of the most sophisticated AI systems ever built works, you’ve come to the right place.

In this three part series, we’ll explore the inner workings of the DeepMind MuZero model — the younger (and even more impressive) brother of AlphaZero.

👉 Part 2

👉 Part 3

Also check out my latest post, about how to train reinforcement learning agents for multi-player board games, using self-play!

👉 Self-Play in Multiplayer Environments

We’ll be walking through the pseudocode that accompanies the MuZero paper — so grab yourself a cup of tea and a comfy chair and let’s begin.

The story so far…

On 19th November 2019 DeepMind released their latest model-based reinforcement learning algorithm to the world — MuZero.

This is the fourth in a line of DeepMind reinforcement learning papers that have continually smashed through the barriers of possibility, starting with AlphaGo in 2016.

To read about the full history from AlphaGo through to AlphaZero — check out my previous blog 👇

AlphaZero was hailed as the general algorithm for getting good at something, quickly, without any prior knowledge of human expert strategy.

So…what now?

MuZero

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

MuZero takes the ultimate next step. Not only does MuZero deny itself human strategy to learn from. It isn’t even shown the rules of the game.

In other words, for chess, AlphaZero is set the following challenge:

Learn how to play this game on your own — here’s the rulebook that explains how each piece moves and which moves are legal. Also it tells you how to tell if a position is checkmate (or a draw).

MuZero on the other hand, is set this challenge:

Learn how to play this game on your own — I’ll tell you what moves are legal in the current position and when one side has won (or it’s a draw), but I won’t tell you the overall rules of the game.

Alongside developing winning strategies, MuZero must therefore also develop its own dynamic model of the environment so that it can understand the implications of its choices and plan ahead.

Imagine trying to become better than the world champion at a game where you are never told the rules. MuZero achieves precisely this.

In the next section we will explore how MuZero achieves this amazing feat, by walking through the codebase in detail.

The MuZero pseudocode

Alongside the MuZero preprint paper, DeepMind have released Python pseudocode detailing the interactions between each part of the algorithm.

In this section, we’ll pick apart each function and class in a logical order, and I’ll explain what each part is doing and why. We’ll assume MuZero is learning to play chess, but the process is the same for any game, just with different parameters. All code is from the open-sourced DeepMind pseudocode.

Let’s start with an overview of the entire process, starting with the entrypoint function, muzero.

Overview of the MuZero self-play and training process

The entrypoint function muzero is passed a MuZeroConfig object, which stores important information about the parameterisation of the run, such as the action_space_size (number of possible actions) and num_actors (the number of parallel game simulations to spin up). We’ll go through these parameters in more detail as we encounter them in other functions.

At a high level, there are two independent parts to the MuZero algorithm — self-play (creating game data) and training (producing improved versions of the neural network). TheSharedStorage and ReplayBuffer objects can be accessed by both halves of the algorithm and store neural network versions and game data respectively.

Shared Storage and the Replay Buffer

The SharedStorage object contains methods for saving a version of the neural network and retrieving the latest neural network from the store.

We also need a ReplayBuffer to store data from previous games. This takes the following form:

Notice how the window_size parameter limits the maximum number of games stored in the buffer. In MuZero, this is set to the latest 1,000,000 games.

Self-play (run_selfplay)

After creating the shared storage and replay buffer, MuZero launches num_actors parallel game environments, that run independently. For chess, num_actors is set to 3000. Each is running a function run_selfplay that grabs the latest version of the network from the store, plays a game with it (play_game) and saves the game data to the shared buffer.

So in summary, MuZero is playing thousands of games against itself, saving these to a buffer and then training itself on data from those games. So far, this is no different to AlphaZero.

To end Part 1, we will cover one of the key differences between AlphaZero and MuZero — why does MuZero have three neural networks, whereas AlphaZero only has one?

The 3 Neural Networks of MuZero

Both AlphaZero and MuZero utilise a technique known as Monte Carlo Tree Search (MCTS) to select the next best move.

The idea is that in order to select the next best move, it makes sense to ‘play out’ likely future scenarios from the current position, evaluate their value using a neural network and choose the action that maximises the future expected value. This seems to be what we humans are doing in our head when playing chess, and the AI is also designed to make use of this technique.

However, MuZero has a problem. As it doesn’t know the rules of the game, it has no idea how a given action will affect the game state, so it cannot imagine future scenarios in the MCTS. It doesn’t even know how to work out what moves are legal from a given position, or whether one side has won.

The stunning development in the MuZero paper is to show that this doesn’t matter. MuZero learns how to play the game by creating a dynamic model of the environment within its own imagination and optimising within this model.

The diagram below shows a comparison between the MCTS processes in AlphaZero and MuZero:

Whereas AlphaZero only has only one neural network (prediction), MuZero needs three (prediction, dynamics, representation)

The job of the AlphaZero prediction neural network f is to predict the policy p and value v of a given game state. The policy is a probability distribution over all moves and the value is just a single number that estimates the future rewards. This prediction is made every time the MCTS hits an unexplored leaf node, so that it can immediately assign an estimated value to the new position and also assign a probability to each subsequent action. The values are backfilled up the tree, back to the root node, so that after many simulations, the root node has a good idea of the future value of the current state, having explored lots of different possible futures.

MuZero also has a prediction neural network f, but now the ‘game state’ that it operates on is a hidden representation that MuZero learns how to evolve through a dynamics neural network g. The dynamics network takes the current hidden state s and chosen action a and outputs a reward r and new state. Notice how in AlphaZero, moving between states in the MCTS tree is simply a case of asking the environment. MuZero doesn’t have this luxury, so needs to build its own dynamic model!

Lastly, in order to map from the current observed game state to the initial representation, MuZero uses a third representation neural network, h.

There are therefore two inference functions MuZero needs, in order to move through the MCTS tree making predictions:

  • initial_inference for the current state. h followed by f (representation followed by prediction) .
  • recurrent_inference for moving between states inside the MCTS tree.g followed by f (representation followed by dynamics).
The two types of inference in MuZero

The exact models aren’t provided in the pseudocode, but detailed descriptions are given in the accompanying paper.

In summary, in the absence of the actual rules of chess, MuZero creates a new game inside its mind that it can control and uses this to plan into the future. The three networks (prediction, dynamics and representation) are optimised together so that strategies that perform well inside the imagined environment, also perform well in the real environment.

Amazing stuff.

This is the end of Part 1 — in Part 2, we’ll start by walking through the play_game function and see how MuZero makes a decision about the next best move at each turn.

Please clap if you’ve enjoyed this post and I’ll see you in Part 2!

This is the blog of Applied Data Science Partners, a consultancy that develops innovative data science solutions for businesses. To learn more, feel free to get in touch through our website.

--

--

David Foster
Applied Data Science

Author of the Generative Deep Learning book :: Founding Partner of Applied Data Science Partners