How To Build Your Own MuZero AI Using Python (Part 1/3)

Teaching a machine to play games using self-play and deep learning…without telling it the rules 🤯

David Foster
Dec 2 · 7 min read

If you want to learn how one of the most sophisticated AI systems ever built works, you’ve come to the right place.

In this three part series, we’ll explore the inner workings of the DeepMind MuZero model — the younger (and even more impressive) brother of AlphaZero.

👉 Part 2

👉 Part 3

We’ll be walking through the pseudocode that accompanies the MuZero paper — so grab yourself a cup of tea and a comfy chair and let’s begin.

The story so far…

This is the fourth in a line of DeepMind reinforcement learning papers that have continually smashed through the barriers of possibility, starting with AlphaGo in 2016.

To read about the full history from AlphaGo through to AlphaZero — check out my previous blog 👇

AlphaZero was hailed as the general algorithm for getting good at something, quickly, without any prior knowledge of human expert strategy.

So…what now?

MuZero

MuZero takes the ultimate next step. Not only does MuZero deny itself human strategy to learn from. It isn’t even shown the rules of the game.

In other words, for chess, AlphaZero is set the following challenge:

Learn how to play this game on your own — here’s the rulebook that explains how each piece moves and which moves are legal. Also it tells you how to tell if a position is checkmate (or a draw).

MuZero on the other hand, is set this challenge:

Learn how to play this game on your own — I’ll tell you what moves are legal in the current position and when one side has won (or it’s a draw), but I won’t tell you the overall rules of the game.

Alongside developing winning strategies, MuZero must therefore also develop its own dynamic model of the environment so that it can understand the implications of its choices and plan ahead.

Imagine trying to become better than the world champion at a game where you are never told the rules. MuZero achieves precisely this.

In the next section we will explore how MuZero achieves this amazing feat, by walking through the codebase in detail.

The MuZero pseudocode

In this section, we’ll pick apart each function and class in a logical order, and I’ll explain what each part is doing and why. We’ll assume MuZero is learning to play chess, but the process is the same for any game, just with different parameters. All code is from the open-sourced DeepMind pseudocode.

Let’s start with an overview of the entire process, starting with the entrypoint function, muzero.

Overview of the MuZero self-play and training process

The entrypoint function muzero is passed a MuZeroConfig object, which stores important information about the parameterisation of the run, such as the action_space_size (number of possible actions) and num_actors (the number of parallel game simulations to spin up). We’ll go through these parameters in more detail as we encounter them in other functions.

At a high level, there are two independent parts to the MuZero algorithm — self-play (creating game data) and training (producing improved versions of the neural network). TheSharedStorage and ReplayBuffer objects can be accessed by both halves of the algorithm and store neural network versions and game data respectively.

Shared Storage and the Replay Buffer

We also need a ReplayBuffer to store data from previous games. This takes the following form:

Notice how the window_size parameter limits the maximum number of games stored in the buffer. In MuZero, this is set to the latest 1,000,000 games.

Self-play (run_selfplay)

So in summary, MuZero is playing thousands of games against itself, saving these to a buffer and then training itself on data from those games. So far, this is no different to AlphaZero.

To end Part 1, we will cover one of the key differences between AlphaZero and MuZero — why does MuZero have three neural networks, whereas AlphaZero only has one?

The 3 Neural Networks of MuZero

The idea is that in order to select the next best move, it makes sense to ‘play out’ likely future scenarios from the current position, evaluate their value using a neural network and choose the action that maximises the future expected value. This seems to be what we humans are doing in our head when playing chess, and the AI is also designed to make use of this technique.

However, MuZero has a problem. As it doesn’t know the rules of the game, it has no idea how a given action will affect the game state, so it cannot imagine future scenarios in the MCTS. It doesn’t even know how to work out what moves are legal from a given position, or whether one side has won.

The stunning development in the MuZero paper is to show that this doesn’t matter. MuZero learns how to play the game by creating a dynamic model of the environment within its own imagination and optimising within this model.

The diagram below shows a comparison between the MCTS processes in AlphaZero and MuZero:

Whereas AlphaZero only has only one neural network (prediction), MuZero needs three (prediction, dynamics, representation)

The job of the AlphaZero prediction neural network f is to predict the policy p and value v of a given game state. The policy is a probability distribution over all moves and the value is just a single number that estimates the future rewards. This prediction is made every time the MCTS hits an unexplored leaf node, so that it can immediately assign an estimated value to the new position and also assign a probability to each subsequent action. The values are backfilled up the tree, back to the root node, so that after many simulations, the root node has a good idea of the future value of the current state, having explored lots of different possible futures.

MuZero also has a prediction neural network f, but now the ‘game state’ that it operates on is a hidden representation that MuZero learns how to evolve through a dynamics neural network g. The dynamics network takes the current hidden state s and chosen action a and outputs a reward r and new state. Notice how in AlphaZero, moving between states in the MCTS tree is simply a case of asking the environment. MuZero doesn’t have this luxury, so needs to build its own dynamic model!

Lastly, in order to map from the current observed game state to the initial representation, MuZero uses a third representation neural network, h.

There are therefore two inference functions MuZero needs, in order to move through the MCTS tree making predictions:

  • initial_inference for the current state. h followed by f (representation followed by prediction) .
  • recurrent_inference for moving between states inside the MCTS tree.g followed by f (representation followed by dynamics).
The two types of inference in MuZero

The exact models aren’t provided in the pseudocode, but detailed descriptions are given in the accompanying paper.

In summary, in the absence of the actual rules of chess, MuZero creates a new game inside its mind that it can control and uses this to plan into the future. The three networks (prediction, dynamics and representation) are optimised together so that strategies that perform well inside the imagined environment, also perform well in the real environment.

Amazing stuff.


This is the end of Part 1 — in Part 2, we’ll start by walking through the play_game function and see how MuZero makes a decision about the next best move at each turn.

Please clap if you’ve enjoyed this post and I’ll see you in Part 2!


This is the blog of Applied Data Science Partners, a consultancy that develops innovative data science solutions for businesses. To learn more, feel free to get in touch through our website.

Applied Data Science

Cutting edge data science, machine learning and AI projects

David Foster

Written by

Founding Partner of Applied Data Science Partners

Applied Data Science

Cutting edge data science, machine learning and AI projects

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade