General Game-Playing With Monte Carlo Tree Search

A primer on the game-playing algorithm behind AlphaGo.

This article explains the idea behind Monte Carlo tree search, while the next one goes into the implementation, up to a fully functional MCTS framework in Node.js that the reader can modify for their own projects.

This article assumes some computer science knowledge on the reader’s part, in particular how recursion works, and how the tree data structure works. It would also be beneficial to have some prior knowledge of the classical adversarial game-playing algorithm minimax, but this is not strictly required. Similarly, some knowledge of JavaScript will be useful to make sense of the skeleton code at the end of the post, but this is not strictly required either.

Introduction

Monte Carlo tree search (MCTS) is a general game-playing algorithm to find the best move from any given game state of any game. There are other such algorithms, most notably minimax, but there are significant benefits to using MCTS over vanilla minimax:

  1. To determine which moves are good, depth-limited minimax needs a function that gives the estimated strength of any given game state. This heuristic function may be difficult to obtain, for example in the case of Go. MCTS does not need such a heuristic function, making it aheuristic.
  2. MCTS efficiently deals with games with a high branching factor. As it gains information, MCTS increasingly favors moves that are more likely to be good, making its search asymmetric.
  3. Minimax needs to run to completion to give the best move, which makes its runtime (and run-space) non-flexible. For games with large state spaces like chess and Go, this exhaustive search may even be intractable. MCTS does not need to run to completion; it outputs stronger plays the longer it runs, but its search can be stopped at any point. Having this flexible property, we say that MCTS is anytime.

While minimax is an elegant algorithm for solving simple games, MCTS is a more powerful alternative for more complex games — even as it gives us approximate solutions instead of absolute ones. Being aheuristic, asymmetric, and anytime makes MCTS an attractive option for complex general game-playing.

MCTS was first described as a framework for game AI in a paper published about ten years ago. Ever since, there has been a lot of research on MCTS, the most high-profile one being Google DeepMind’s research with AlphaGo. AlphaGo uses MCTS and deep learning to achieve superhuman performance in the game of Go, previously thought by experts to be at least a decade away.

MCTS in Brief

There are many many many good explanations out there on how MCTS works. This is my own.

The first three levels of the game tree for Tic-Tac-Toe

Games can be modeled as trees, where the nodes represent states, and the edges represent moves. The above is the game tree for Tic-Tac-Toe. At level 0, we have the root node representing the starting state of the game, which is an empty 3×3 Tic-Tac-Toe board. At level 1, we have the three states possible after X makes the starting move (there are effectively only 3 moves due to symmetry). The next level down, we have the states possible after O responds. This goes all the way down, to the completion of the game.

There are many ways to search for the best move in such a game tree. The simplest one is to search every path in the tree, choosing the move that is guaranteed to produce the best outcome even against an optimally playing oponent. This is what minimax does. However, for many complex games, this is not possible because the game tree is very large.

MCTS deals with large trees effectively by sampling many paths down the game tree. This means it repeatedly goes down many, but not all, of the paths. As it tries more paths, it gains better estimates for which paths are good. Each one of these sample trials is called a Monte Carlo simulation, and each one of these simulations plays a sample game to the end to see which player wins. A lot of simulations have to be played before MCTS can get enough information to output a good move.

To store the statistical information gained from these simulations, MCTS builds its own search tree from scratch, node by node, during the simulations. The search tree corresponds to the game tree, and its nodes additionally stores the statistical information needed by MCTS to choose good moves. The diagram below illustrates the procedure.

MCTS algorithm, diagram from Chaslot (2006)

The above figure details the actual procedure. In phase (1), existing information is used to repeatedly choose successive child nodes down to the end of the search tree. Next, in phase (2), the search tree is expanded by adding a node. Then, in phase (3), a simulation is run to the end to determine the winner. Finally, in phase (4), all the nodes in the selected path are updated with new information gained from the simulated game. This 4-phase algorithm is run repeatedly until enough information is gathered to produce a good move.

It’s important to note that the search tree is identical in structure to the game tree, and that the search tree additionally contains statistical information gathered from the simulations. We do not actually have the entire game tree in our program, nor do we attempt to build one; it would be too massive! Instead, we build the search tree from scratch and its structure corresponds to a subset of the game tree’s structure, likely a very small subset.

MCTS in Detail

Now, let’s talk about each of the 4 phases above in greater detail. In the tree diagrams below, each circular node corresponds to a game state and each line corresponds to a move that can be made to get from one state to another. Moves which do not yet have corresponding nodes in the search tree are represented by lines ending in black dots. The game is between blue and red, and blue has the first move. Several iterations of the 4-phase algorithm have been played through to get the existing tree, and we inspect the steps taken by the algorithm in the subsequent iteration.

Phase 1: Selection

Starting from the root node of the search tree, we go down the tree by repeatedly (1) selecting a legal move and (2) advancing to the corresponding child node. If one, several, or all of the legal moves in a node does not have a corresponding node in the search tree, we stop selection.

Our path selection should achieve two goals: We should explore new paths to gain information, and we should use existing information to exploit paths known to be good. In order to help us achieve these two goals, we need to select child nodes using a selection function that balances exploration and exploitation.

One (bad) way to do this is to select paths randomly. Random selection certainly does explore well, but it does not exploit at all. Another (equally bad) way is to use the average win rate of each node. This achieves good exploitation, but it scores poorly on exploration. Luckily, some very smart people have figured out a good selection function that balances exploration with exploitation well, called UCB1 (Upper Confidence Bound 1). When applied to MCTS, the combined algorithm is named UCT (Upper Confidence Bound 1 applied to trees). So, MCTS + UCB1 = UCT. The UCB1 selection function is:

  • wᵢ : this node’s number of simulations that resulted in a win
  • sᵢ : this node’s total number of simulations
  • sₚ : parent node’s total number of simulations
  • c : exploration parameter

The left term (wᵢ / sᵢ) is the exploitation term. It is simply the average win rate, going larger the better a node has historically performed. The right term (sqrt(ln sₚ / sᵢ)) is the exploration term. It goes larger the less frequently a node is selected for simulation. The exploration parameter c is just a number we can choose that allows us to control how much the equation favors exploration over exploitation; the usual number chosen is c = sqrt(2).

Note the numbers inside the nodes in the tree diagram. The numbers are the statistics for that node, corresponding to number of wins and total number of simulations (wᵢ and sᵢ). Each time we need to select between multiple child nodes, we use the UCB1 selection function to get a UCB1 value for each child node, and we select the child node with the maximum value.

Note: The problem of having an agent simultaneously balance exploration and exploitation between several choices when the payout of each choice is unknown is called the multi-armed bandit problem, and it’s a well-known problem in planning and resource allocation.

Phase 2: Expansion

After selection stops, there will be at least one unvisited move in the search tree (hereafter referred to as unexpanded moves). Now, we randomly choose one unexpanded move and we then create the child node corresponding to that move (bolded in the diagram). We add this node as a child to the last selected node in the selection phase, expanding the search tree. The statistics information in the node is initialized with 0 wins out of 0 simulations (wᵢ = 0, sᵢ = 0).

Some implementations choose to expand the tree by multiple nodes per simulation, but the most memory-efficient implementation is to create just one node per simulation. The search tree can get very large, especially for games with large branching factors.

Phase 3: Simulation

Continuing from the newly-created node in the expansion phase, moves are selected randomly and the game state is repeatedly advanced. This repeats until the game is finished and a winner emerges. No new nodes are created in this phase.

This is a good time to remind ourselves that nodes in the search tree correspond to nodes in the game tree. In this phase, we are simply applying the rules of the game to repeatedly (1) find all legal moves in the current game state, (2) choose one legal move randomly, then (3) advance the game state. No part of this process is stored. This phase ends when we reach a state where the game is finished.

Note: A game played during the simulation phase is either called a simulation or a playout. This may get confusing because one iteration of the whole 4-phase process is also sometimes called a simulation or a playout. Be aware of this and use your discretion to disambiguate.

Phase 4: Backpropagation

After the simulation phase, the statistics on all the visited nodes (bolded in the diagram) are updated. Each visited node has its simulation count sᵢ incremented. Depending on which player wins, its win count wᵢ may also be incremented. In the diagram, blue wins, so each visited red node’s win count is incremented. This flip is due to the fact that each node’s statistics are used for its parent node’s choice, not its own.

It should be noted that, for a two-player game like Letterpress, MCTS tries to find the best moves down a path for each player respectively. So, just as it finds blue’s best move for nodes where blue is next to move, it also finds red’s best move for nodes where red is next to move.

Now, let’s take a second look at the UCB1 algorithm:

  • wᵢ : this node’s number of simulations that resulted in a win
  • sᵢ : this node’s total number of simulations
  • sₚ : parent node’s total number of simulations
  • c : exploration parameter

In the selection phase, MCTS uses the UCB1 selection function to make a decision on which child node to select. The UCB1 function, in turn, uses the numbers of wins wᵢ and simulations sᵢ of the children nodes, and the number of simulations of the parent node sₚ, to generate the UCB1 values for each child node. This information is updated during the backpropagation phase, coming full circle. Thus, the information gained this iteration will affect the selection of moves for the next iteration, and so on.


The above diagram shows the search tree before and after one iteration of the algorithm. You can see that one unexpanded move is expanded, resulting in the creation of one new node in the tree. Additionally, the statistics of all the visited nodes are updated with the result of the simulation.

So that’s the gist of the 4-phase MCTS algorithm. The beauty of MCTS(UCT) is that, due to its asymmetrical nature, the tree selection and growth gradually converges to better moves. At the end, you get the child node with the highest number of simulations sᵢ and that’s your best move according to MCTS.

Are you getting antsy yet? I know I get restless with too much theory. Let’s start writing up some code!

JavaScript Implementation

If you’re just interested in the high-level framework of MCTS code, read on. If you’re set on understanding the details of a full implementation, you can jump ahead to the sequel, either now or at the end of this article.

This section is heavily influenced by this other article, which details another good implementation of MCTS in Python. I chose to write my implementation in Javascript on Node.js (v8.11.3 LTS).

Let’s begin writing the skeleton of the Game and MonteCarlo classes for our app, starting with the Game class in game.js:

/** Class representing the game board. */
class Game {
  /** Generate and return the initial game state. */
start() {
// TODO
return state
}
  /** Return the current player’s legal moves from given state. */
legalPlays(state) {
// TODO
return plays
}
  /** Advance the given state and return it. */
nextState(state, move) {
// TODO
return newState
}
  /** Return the winner of the game. */
winner(state) {
// TODO
return winner
}
}
module.exports = Game

An important point to note is that the Game class is state-agnostic. It is an encapsulation of the rules of our game and nothing else.

Next up, the MonteCarlo class in monte-carlo.js:

/** Class representing the Monte Carlo search tree. */
class MonteCarlo {
  /** From given state, repeatedly run MCTS to build statistics. */
runSearch(state, timeout) {
// TODO
}
  /** Get the best move from available statistics. */
bestPlay(state) {
// TODO
// return play
}
}
module.exports = MonteCarlo

The MonteCarlo class is responsible for running simulations, and building a search tree containing the information gained from these simulations. When asked for the best play, the MonteCarlo class should return the best move based on the information it gained during the simulations. Note that runSearch() here runs many iterations of the 4-phase algorithm: selection, expansion, simulation, and backpropagation, until the specified timeout.

So that’s the high-level overview of how this is going to be done. The MonteCarlo class will handle running the simulations, handling things like move choice, not caring about the rules of the game whatsoever. Conversely, the Game class will encapsulate the rules of the game, handling things like generating the legal moves for any given state, not caring about the Monte Carlo simulations whatsoever.

Now, for the starting point of our Node app, index.js:

const Game = require('./game.js')
const MonteCarlo = require('./monte-carlo.js')
let game = new Game()
let mcts = new MonteCarlo(game)
let state = game.start()
let winner = game.winner(state)
// From initial state, take turns to play game until someone wins
while (winner === null) {
mcts.runSearch(state, 1)
let play = mcts.bestPlay(state)
state = game.nextState(state, play)
winner = game.winner(state)
}
console.log(winner)

Once we fill in all the TODOs, this program should play a game against itself, each player taking 1 second to build up the search tree. This gives us a good framework to fill in in the next article.

Further Reading


Edit (18 Oct 2017): Remove specific Letterpress game-playing to make the article more broadly relatable.

Edit (20 Oct 2017): Add Further Reading section.

Edit (11 Jul 2018): Add sequel article, modify skeleton code.