AlphaZero for dummies!

michelangelo
6 min readAug 10, 2022

--

We will see how to develop a simple but working implementation of AlphaZero, a revolutionary AI algorithm developed by DeepMind.

Photo by Andrea De Santis on Unsplash

If you have followed the previous articles, you know that we are now at the most exciting part of the series, and we are going to take the fly: we are going to integrate the previous MCTS implementation with an AI Deep Neural Network, in order to train an expert policy to play video games!

If you do not think this is a big deal, keep in mind that the AlphaZero algorithm, along with its predecessor AlphaGo, and successor MuZero, “solved” the game of Go, which was at the time considered from the AI research community at least 10 years away to be solved.

This is incredible, and even more incredible is the elegance and simplicity of such algorithms which yes, in their full implementation have many optimizations that may be difficult to understand at first, but, at the core, have some basic ideas that make such algorithms work, and we will build on such core ideas to create an AlphaZero implementation on a small scale.

If you need a refresher on the MCTS algorithm, have a look at the previous article. Also, I will assume some familiarity with basic Neural Networks concepts. If you do not have any, I would suggest the MIT introductory course.

Let’s see more in detail what all of this is about.

The problem with the MCTS algorithm is that, even if it is better than brute-force search, as its search is guided towards more promising states, still, for complex games, the state space is so big that, in order to gather a statistically valid amount of data, the time required is not feasible, and the algorithm fall short.

The core idea of the AlphaZero algorithm, is to have a neural network to “guide” the search along the tree, producing values and exploring probabilities for the nodes. In the end, the MCTS algorithm is based on statistics and approximated data, and neural networks do exactly that, they approximate (arbitrarily complex) functions.

We will need to estimate two things:

  • Value estimation: every time we need to do a rollout from the node, instead of doing an actual rollout, we will ask the neural network what is the (estimated) value of the node.
  • Policy estimation: the probability to explore each node will not only depend on the number of times that it has been visited, but also on a probability given by the neural network, which will approximate the MCTS probability distribution.

We will use the combination of the Value and Policy estimations in order to guide our search along the tree to more promising nodes.

The closer the Value estimation is to real state value, the better the Policy will be and, vice-versa, the better the Policy, the better the Value estimation.

When the neural networks closely approximate the required values, we can then use them to choose an action at each step of the game, and play it like an expert.

Let’s see how to code it!

Here the usual code to import an openAI gym environment:

Notice, however, how this time we are using the version 1 of the environment (i.e. CartPole-v1), whose maximum possible sum of rewards is 500.

Now we have to define the Deep Neural Networks that we will use to approximate the Value and Policy of the tree nodes.

We are using 2 small, fully connected neural networks:

  • The first will approximate the value of the node, hence it will have 1 output value.
  • The second will approximate the probabilities of choosing each child node, hence it will have as many output values as actions available in the game.

Here our MCTS implementation, now combined with the use of the neural networks:

I will not go into the details of the basic MCTS algorithm, as it has been explained in a previous article, but let’s focus on the key differences:

  • The rollout function now uses both the Value and Policy neural networks to approximate the values needed for guiding the MCTS node selection.
  • The getUCBscore uses such values in order to balance the exploitation (i.e. try to choose a good action) and the exploration (i.e. exploring new actions). The exploitation is represented by the Value score (i.e. the average of all the rollouts started from that node and from its children nodes which then have propagated their values to root nodes) and the exploration is represented by the Policy score but is also affected by how many times the node has already been visited (i.e. nodes that have low visit count have higher values, so as to favor their exploration).

The rest is the usual MCTS algorithm stuff, we have just changed its selection mechanism using the neural networks to approximate the node values and policy.

We also need a structure where we can store the information about game episodes that we will use to train our neural networks to predict accurate values. We will call this structure a ReplayBuffer.

The ReplayBuffer stores, for each step of the game:

  • The observation (i.e. state) of the game environment
  • The target Value (i.e. the sum of rewards obtained in the the episode)
  • The observation (i.e. state) of the game environment at the previous step
  • The target Policy according to visit counts

We will then train our neural networks to accurately predict the target Value and target Policy, given the game state observations.

The strategy we will use to play the game is as usual to do a certain amount of searches in the tree for collecting reliable statistics and then picking the best action, which will be the one relative to the children node of the current state that has been visited the most (because, as we have seen in the MCTS algorithm, it is related to a good value anyway, given how the value of the node is calculated).

And here we are, defining the loop that will allow us to train the neural networks and plot the improvements over time:

Notice how we define some hyperparameters and proceed to play games and train the neural networks incrementally approximate closely the actual Value and Policy of the MCTS nodes.

And finally these are the plots of the rewards over episodes along with the moving average over the last 20 episodes, the Value loss and the Policy loss.

Plot of rewards over 200 episodes in blue, with moving average in orange.
Value loss over episodes.
Policy loss over episodes.

We can see that learning is definitely happening and in the final episodes of the training AlphaZero constantly reaches 500 rewards per episode, which is the maximum for this environment!

You can write the code on your own (suggested) or find the repository on GitHub.

I hope that this was both instructive and fun. For me, it is astonishing that with such few basic ideas we are able to write algorithms that learns to play by themselves.

As you have noticed, however, AlphaZero requires to know the “rules” of the game. Everytime we take a step, we are using a copy of the game environment to simulate an action and get all the required information on the evolution of the state of the game.

Wouldn’t be amazing if we are able to get the same results, but knowing nothing about the game we are playing? If AlphaZero is not enough magic then follow me in the next chapter, where we will implement MuZero, an evolution of AlphaZero with no knowledge at all about the game, and we will see how it is possible to learn everything about the game using neural networks!

--

--

michelangelo

Computer Scientist at the core | AI passionate | Deep Reinforcement Learning enthusiast