Monte Carlo Tree Search (MCTS) algorithm for dummies!

6 min readAug 4, 2022

If we really want to have a chance at understanding how AlphaZero and MuZero works, then we have to stop first at unfolding the Monte Carlo Tree Search (MCTS) algorithm, which is at the foundation of both.

Roughly speaking, the MCTS allows us to search for the best moves to play relying on statistics rather than on full exploration, hence allowing us to navigate large state spaces with much less limits compared to brute-force approaches.
In fact, search is guided in order to explore promising nodes more often and the more statistics are collected, the more reliable the results obtained.

More precisely, the MCTS is divided in 4 different steps:

Selection: we need a smart way to explore tree nodes that leads to more promising outcomes, hence we need to have a way to give value to nodes to choose which one to pick.
Expansion: once we reach a leaf node (i.e. a certain state of the game) we need to expand the tree considering all the possible valid actions from that node (i.e. subsequent game states after taking valid actions).
Simulation: we need to be able to play a completely random game from a certain node (i.e. game state). Such rollout will result in a random result if taken alone but, the more times we will do rollouts from such node, the more accurate the average value estimate of the node will be.
Backpropagation: after the rollout, such results needs to be back-propagated up to the tree, in order to store such information in the upper nodes, so as to transfer information from leafs to root nodes.

The 4 steps of the MCTS algorithm, courtesy of Wikipedia.

To sum up, the more we apply the 4 steps of the MCTS algorithm, the more reliable and statistically valid information we gather about the next best action to take from a particular state of the game. Apply this for every move, and we have a strategy for playing the game like a boss!

Let’s see how to implement it and how to use it to play an openAI gym video game.

The first thing that we need to do is to create an openAI gym environment, which will be the test bed for our algorithm, and check for the number of possible actions and the observation space in the game.

For the CartPole environment, whose goal is to balance a pole by applying forces to the cart to which is attached, there are 2 possible actions (push cart to the left or to the right), and the observation is made of 4 values (cart position, cart velocity, pole angle and pole angular velocity). You can find detailed information about this environment here.

Next, the definition of the Node class, which will represent a node of the tree:

The Node class will store information that we need about the game, the children (i.e. the next states of the game), and value statistics. In particular:

child: is a dictionary of child nodes which, for every possible action from that state, will tell us what the next state of the game is taking that action
T: represents the sum of the value of the rollouts that have been started from this node
N: represents the visit count of the node, i.e. how many times we have visited the node
game: is the game environment in the current state represented by the node. Its a copy of the original game that we are currently playing, so that we can use it to simulate the search. It represents the current state of the game that will result in applying all the actions from the root node to the node itself
observation: is the current state of the game in that node, in our example, it will be the 4 values as described above
done: will tell us if the game at this point is ended, so as to stop the search
parent: a connection to the parent node, for backpropagation
action_index: is the index of the action that lead to this node (i.e. is the action that the parent of the node has taken to get into this node)

Now let’s define methods of the Node class that we will use for the search.

The first one will be the definition of the function that will give values to nodes. This is fundamental, as will guide our search along the tree:

The formula is called UCB (Upper Confidence Bound) and balances exploitation (i.e. pick the best known action) and exploration (i.e. explore new actions):

The exploitation term is the current estimated value of the node (self.T / self.N).
The exploration term is inversely proportional to the number of times the node has been visited, with respect to the number of visits of its parent node (sqrt(log(top_node.N) / self.N)).

There is also a tunable constant c that will give more or less value to the second term.

Next we need a function that creates the children of the current node:

Notice how we always use the whole game actions available (GAME_ACTIONS) and create a child for each action, associating to it a copy of the game environment, to which we apply such action (game.step(action)), and store the information of the resulting state of the game in the child node.

We are now ready for the core of the algorithm, where the search happens:

The search along the tree starts from a certain node and recursively picks a child node that maximizes the formula as defined before. This guides our search until a leaf node is reached. At this point, if it is the first time we visit such leaf node, we will play a rollout and store the statistics, if it is not the first time we visit such node, then we create its children and play a rollout from one randomly chosen child. In every case the updated statistics, both value and visit count are backpropagated to parent nodes up to the root.

We are almost there, we just need to define how to do a random rollout, using a copy of the game from the current node (i.e. in the actual game state):

And we need to define how to pick the next action after the search is done:

In particular, we will pick the child node from the current one that we are searching from which has the maximum visit count, as this is associated with a good value anyway, given how the MCTS formula works (i.e. it visits more often nodes that have higher estimated values).

We are all set now, and we need to define the policy that we will use to play the game:

The policy is straightforward: every time we need to do an action in the game, we will use the MCTS to explore and pick the best possible next move as just discussed. The search goes on for a certain number of iterations based on the value of a fixed parameter (MCTS_POLICY_EXPLORE) and, as discussed, the higher the time we spend collecting statistics with the search, the more reliable the results will be.

We will apply such policy for every move of the game so as to have an MCTS guided expert player.

And here we go, the code below allows us to play a game choosing moves according to the MCTS just defined, plotting the results at the end!

The CartPole-v0 environment used as example is quite simple, and the MCTS algorithm is able to collect enough statistics in a reasonable amount of time, making it a good player constantly reaching 200 points per game (which is the maximum allowed for the environment)!

Congratulations, you made it!

I hope you had fun as much as me in coding the MCTS algorithm.

You can write the code on your own (suggested) or find the repository on GitHub.

For me, the excitement in seeing that such generic algorithm is able to play different video games is very high, but nothing compared to what we will see next, that is, how we can use AI Deep Neural Networks, combined with the MCTS algorithm, to create a simple but working implementation of AlphaZero.

Intro: MCTS, AlphaZero and MuZero for dummies!

Part 1: Monte Carlo Tree Search (MCTS) algorithm for dummies!

Part 2: AlphaZero for dummies!

Part 3: MuZero for dummies!

Monte Carlo Tree Search (MCTS) algorithm for dummies!

Written by michelangelo