Lessons From Implementing AlphaZero

DeepMind’s AlphaZero publication was a landmark in reinforcement learning (RL) for board game play. The algorithm achieved superhuman performance in chess, shogi, and go, each with under 24 hours of self-play, using almost no specialized or hardcoded knowledge of the games other than the rules.

We thought it would be instructive to replicate and extend their results in novel ways, discovering in the process how various choices affect the performance of the algorithm. Machine learning as a whole is only slowly being put on firm theoretical foundations, and so there’s often much to gain by exploring new directions.

This article is the first in a series where we share insights gleaned during our exploration.


For this first article we’ll just give an overview of the unmodified AlphaZero algorithm. Instead of making you read the paper yourself, we’ll try to lay it out in a way comprehensible to a relative novice in RL, linking to external resources for more detail.

At a high level, it uses a modified Monte-Carlo Tree Search (MCTS) backed by a deep residual neural network (or “ResNet”). AlphaZero plays against itself, with each side choosing moves selected by MCTS. The results of these self-play games are used to continually improve the ResNet. The self-play and ResNet training occur in parallel, each improving the other.

Let’s unpack that, starting with the ResNet.


The Neural Network

ResNet architectures are a popular way to train deep networks, often for image recognition. For AlphaZero’s network, the input is the board state, and there are two outputs:

  1. The estimated value (v) of the position, ranging from 1 (win) to -1 (loss).
  2. A vector of prior probabilities (p) for playing each next possible action.

The network is initialized randomly. After some self-play has occurred, training data consists of randomly chosen board positions from played games, in the form of (board state, game result, child visit counts from MCTS).

MCTS

This video does an excellent walk-through of standard MCTS. To choose a move given a board position, we perform 800 iterations of the following algorithm. We build a tree, initially having just one node (representing the current board state).

  1. Start at the root node (current board state). Walk to the child that gives the best tradeoff between exploiting current information and exploring new moves (formalized by the UCB1 formula). Recurse ’til you hit a leaf node.
  2. If this is the first visit to that node, perform a rollout: randomly simulate moves until end of game, and then use that game result to update the values of all nodes from the leaf back to the root.
  3. If it is the second visit, expand it (i.e., create its children), and visit+rollout one of them.
  4. There is no third visit, because it is no longer a leaf node.

In AlphaZero, rollouts are replaced by fetching predictions from the NN, and UCB1 is replaced by PUCT (polynomial upper confidence tree). The algorithm looks like this:

  1. Start at the root. Walk to the child with best PUCT score. Recurse ’til you hit a leaf node.
  2. On the first visit, invoke the NN to fetch (1) the estimated game score (or value, v) and (2) the suggested probabilities (p) for visiting each child, to be used in PUCT. Create the children but do not visit them.
  3. There is no second visit.

Tying it Together

After 800 iterations of MCTS, a move is chosen by selecting the child that was visited most. Then the other side plays in the same way. This continues until the game is over. At that point, each board position from the game is annotated with the game score and the child visit counts. These samples get added to the training set for the ResNet. As the ResNet trains, it gets used in subsequent MCTS self-play.

This back-and-forth between MCTS self-play and ResNet training is very powerful. The ResNet helps to generalize the learnings from MCTS, and the self-play uses that generalized knowledge to learn the game more deeply. As we saw from the paper, this allows AlphaZero to outperform basically all existing algorithms — the best of which have been painstakingly hand-tuned with the help of masters of their respective games.


That’s a very rough overview of AlphaZero training. We’ve left out a plethora of details, which we will explore in upcoming episodes.

Stay tuned!

Parts 2, 3, 4, 5, and 6 are now out.

Like what you read? Give Aditya Prasad a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.