Tic Tac Toe at the Monte Carlo

Creating a Tic Tac Toe AI that uses the Monte Carlo tree search algorithm.

Published in

The Startup

10 min readJan 17, 2018

Tic Tac Toe, like other turn based games where information is not hidden and game mechanics are not reliant on chance, is a perfect information game. This type of game allows every player to predict all possible outcomes from someone’s action. Since the game if fully deterministic, a tree can be constructed with all the possible outcomes from the game with each node of the tree given a value determining its win or loss ratio for each player.

An AI can then traverse this tree and choose a node it deems most likely to lead to a victory. If the AI’s strategy in choosing nodes is to pick the node with the smallest possibility of losing, it would be utilizing a minimax strategy. While in theory this sounds like a feasible game plan, in practice the amount of time to render a full game tree and then traverse it can be unrealistic especially for games with large amounts of possible moves (high branching factor) like Go. This is where the Monte Carlo tree search algorithm comes in.

Monte Carlo tree search (MCTS) is a heuristic search algorithm that is employed for a large number of game playing AIs. Most notable of them is the Go AI, Alpha Go. MCTS shines in games with high branching factors because unlike minimax which needs a full game tree, MCTS can be configured to stop after a desired amount of time and can select a sufficiently optimal solution based on the partially constructed game tree. While a pure Monte Carlo process would run a multitude of randomly simulated game states, MCTS keeps statistics of each nodes and then applies the Upper Confidence Bound (UCB) algorithm to the node’s stats.

The UCB function determines a confidence interval for win ratios of each node and returns the highest value in the confidence interval. The function also giving nodes with more simulations a narrower confidence interval while enlarging the confidence intervals of nodes with fewer simulations. The MCTS algorithm selects the node with the highest UCB value to run its simulations on and as a result, although nodes that have a high upper confidence interval may be selected multiple times at first, other nodes with less simulations will have increasingly wider confidence intervals. As their confidence interval grows, their upper bounds may emerge higher than that of the original promising node. Described mundanely, as we test one node more times, our confidence on its payouts become more definitive. We become more certain of its payouts and can narrow our confidence interval for it. We become less confident about the less simulated nodes and widen their confidence intervals to reflect our uncertainty.

The UCB formula balances the exploration/exploitation problem by making sure that no potential node will be starved of simulated playouts but at the same time making sure that favorable branches are played more often than their counterparts. The UCB formula that we will be using is displayed below.

w stands for the number of simulated wins from that node after the i-th move.
n stands for the number of simulations that have occured for that node after the i-th move.
c stands for the exploration parameter. We will be using √2.
t stands for the total number of simulations that have occured after i moves.

The first part of the equation corresponds to exploitation. The higher the number of wins = higher win ratio = higher number for the first part of the equation. The second part of the equation corresponds to exploration. A large number of total simulations but a small number of simulations for that node will lead to a larger number for the second part of the equation. Together, the two parts of the UCB function creates a balance between prioritizing promising nodes but also exploring unvisited ones.

There are four basic phases to MCTS: selection, expansion, simulation,and back-propagation.

In selection, we start at the root node, R, and we select successive child node with the highest UCB value until we reach a node with no more children, L.

Unless the L node results in a game state that ends the game, we then go into the expansion phase and create all possible outcomes from that game state and randomly select one of L’s child nodes, C.

In the next phase, simulation, we generate a random playout from node C to an end game state (win, loss, tie).

We use the resulting game state during back-propagation to update the information within the nodes from C to R with whether the new playout resulted in a win, loss or tie.

Altogether it looks like:

The algorithm will cycle through all 4 phases until the allotted time has passed. Afterwards, it will choose the child node of the root with the highest amount of visits due to the fact that more visits equals more selections which only occurs if it has the highest UCB value. This then repeats for each turn of the game.

I will now be showing an example of MCTS built with Javascript. This is just a general game playing formula and can be tailored for any turn-based game. In this case specifically, the algorithm will be used for a Tic Tac Toe AI. While Tic Tac Toe is a much smaller game than Go with a vastly smaller branching factor, they’re both using the same principles.

The main goal of MCTS is to find the best possible move at your current state. To reach this goal, we should create a function called findNextMove. I will explain each part of the following function later on.

findNextMove(board, playerNo) {
  let opponent = 3 - playerNo;
  let tree = new Tree();
  let rootNode = tree.root;
  rootNode.state.board = new Board(board);
  rootNode.state.playerNo = opponent;  let startTime = Date.now();
  while ((Date.now() - startTime) < 1000) {
    let promisingNode = selectPromisingNode(rootNode);
    
    if (promisingNode.state.board.checkStatus() === IN_PROGRESS) {
      expandNode(promisingNode);
    }    let nodeToExplore = promisingNode;
    if (nodeToExplore.childArray.length > 0) {
      nodeToExplore = promisingNode.getRandomChild();
    }    let playoutResult = simRanPlayout(nodeToExplore, opponent)
    backpropagation(nodeToExplore, playoutResult);
  }  let winnerNode = rootNode.getChildWithMaxScore();
  return winnerNode.state.board;
}

Let us first define what node, state, and board are. The board is an object that keeps track of the size of the board, what pieces have been played and where and what values the status of the game means. The node object keeps track of the state, The state is an object that keeps track of the board, the current player, the number of times the node has been visited, and the win score of the node.

Essential information we will need to find the optimal move would be the state of the board as well as whose turn it is, both of which are given to the function through its parameters. The function then creates the tree it will be traversing. The tree’s root node will have the board’s current state.

The function then goes into a while loop that runs for a set amount of time. The longer we allow it to run, the more confident we are about the AI’s next move. You can see the 4 phases of MCTS in the while loop.

selection phase = selectPromisingNode()
expansion phase = expandNode()
simulation phase = simRanPlayout()
back-propagation phase = backpropagation()

Let us expand the selectPromisingNode function.

selectPromisingNode(rootNode) {
  let node = rootNode;
  while (node.childArray.length !== 0) {
    node = UCB.findBestNodeWithUCB(node);
  }
  return node;
}

In this function, we traverse the tree starting from the root node and find the most promising leaf node below it. This is achieved by looking at the UCB values of each child node and selecting the node with the highest value. The process repeats until a node has no more child nodes. The functions to compute the UCB values is below.

UCB = {  findBestNodeWithUCB: (node) => {
    let parentVisitCt = node.state.visitCount;
    let childUCB = [];
    node.childArray.forEach(child => {
      childUCB.push(UCB.ucbValue(parentVisitCt, child.state.winScore, child.state.visitCount));
    }    let max = Math.max(...childUCB);
    let idx = childUCB.indexOf(max);
    return node.childArray[idx];  };ucbValue: (totalVisit, nodeWinScore, nodeVisit) => {    if (nodeVisit === 0) {
      return Number.MAX_SAFE_INTEGER;
    }
    return (nodeWinScore / nodeVisit) + 1.41 * Math.sqrt ( Math.log (totalVisit) / nodeVisit);
  }}

Corresponding to the UCB formula above, we use the number of total visits from its parent node, t, the number of simulated wins the current node has generated, w, and the amount of times the node has been selected, n, to find the UCB value. If the node has never been visited before, we give its value the maximum integer possible to increase its chances of being selected. This makes sure that every node will be simulated at least once. The findBestNodeWithUCB function generates a UCB value for every child node and returns the child node with the highest value.

Once we’ve selected the node we are going to use, we check the status of the game board. The checkStatus function return -1 if the game is still in progress, 0 if it results in a tie, or 1 and 2 depending on which player won the game. If the game status is a non-game ending state, -1, we will then go into the expansion phase. Let’s take a look at the expandNode function.

expandNode(node) {  let possibleStates = node.state.getAllPossibleStates();  possibleStates.forEach(state => {
    let newNode = new Node(state);
    newNode.parent = node;
    newNode.state.playNo = node.state.getOpponent();
    node.childArray.push(newNode);
  });}

This function first finds all possible outcomes from the current nodes state. It then creates a node with every possible move from the current state and pushes it into the child array of the current node. Although expansion seems like a very grandiose term, the phase is extremely simple. It just populates the childArray with possible game states unless the selected node is a node with an end game status. The next lines in the algorithm:

let nodeToExplore = promisingNode;
if (nodeToExplore.childArray.length > 0) {
  nodeToExplore = promisingNode.getRandomChild();
}

checks if the current array had any possible childs and randomly select one of the nodes to run for its simulated gameplay. The simulated gameplay is done in the simulateRandomPlayout function.

simulateRandomPlayout(node, opponent) {  let tempNode = new Node(null, node);
  let tempState = tempNode.state;
  let boardStatus = tempState.board.checkStatus();  if (boardStatus === opponent) {
    tempNode.parent.state.winScore = Number.MIN_SAFE_INTEGER;
    return boardStatus;
  }  while(boardStatus === board.IN_PROGRESS) {
    tempState.togglePlayer();
    tempState.randomPlay();
    boardStatus = tempState.board.checkStatus();
  }
  return boardStatus;}

Here the function first checks the node’s board to see what its status is. If the node results in an opponent’s victory, it would mean that if the player had made the selected move, his opponent will have a subsequent move that can result in the opponents immediete victory. Because the move the player chose can lead to a definite loss, the function lowers the parents node’s winScore to the lowest possible integer to prevent future moves to that node. Otherwise the algorithm alternates random moves between the two player until the board results in a game ending state. The function then returns the final game status.

The final step of MCTS is backpropogation.

backpropagation(nodeToExplore, playoutResult) {
  let tempNode = nodeToExplore;
  while (tempNode !== undefined) {
    tempNode.state.visitCount++;
    if (tempNode.state.playNo === playoutResult) {
      tempNode.state.addScore(10);
    }
    tempNode = tempNode.parent;
  }
}

Here we travel back from the child of the leaf node that we selected to the root node. Along the way we increment the nodes visitCount. If the playoutResult that we get from the return of simulateRandomPlayout equals the player number whose turn we are simulating, we then increment the win scores of each node. These updated statistics are then used for the next iteration of the selection phase when calculating UCBs.

These steps run for as long as we specify, in this case 1000 milliseconds or 1 second. Once the time runs out, we select the child node of the root with the most visit counts since a node will only be visited if it has the highest UCB which theoretically means that move will lead to the highest chance of victory.

We can test our algorithm with the following function that simulates a game of Tic Tac Toe between two AIs:

simulateAIPlay() {
  let board = new Board();
  let player = 1;
  let totalMove = 9;
  for (var i = 0; i < totalMove; i++) {
    board = findNextMove(board, player);
    if (board.checkStatus() !== -1) {
      break;
    }
    player = 3 - player;
  }
  let winStatus = board.checkStatus();
  return winStatus;
}

Since both AI’s are using this super awesome algorithm and both will be selecting the most optimal moves, each game will result in a tie.

You can check out my full code here.

Much thanks to this article for helping me create the MCTS functions.

Tic Tac Toe at the Monte Carlo

Creating a Tic Tac Toe AI that uses the Monte Carlo tree search algorithm.

This story is published in The Startup, Medium’s largest entrepreneurship publication followed by 286,184+ people.

Subscribe to receive our top stories here.

Written by Max Peng