# Mastering the game of Go with deep neural networks and tree search

This article provides a summary of the AlphaGo paper presented by the DeepMind’s team (David Silver et al.) in 2016.

# Introduction

In artificial intelligence, the game of Go has long been considered as the most challenging game of perfect information to implement due to its complexity. Usually, this type of game may be solved thanks to a search tree, considering each possible moves from a state *s* and computing for each one of them a value according to an optimal value function *v*(s)*, that gives, at every board position, the outcome of the game, that is the probability of winning with each particular disposition of the board. The problem with the game of Go is the complexity of such a tree, as the number of possible moves to evaluate in each state is way too big.

However, it is possible to reduce the size of the tree with two principles. First, the depth of the tree can be reduced by position evaluation, which consists in truncating the tree at a certain state and evaluating its value with an approximate value function *v(s)* ≈ *v*(s)*. Second, the breadth of the tree can be reduced by sampling some possible moves according to a policy function *p(a|s)* that gives a probability distribution over all possible moves *a* in state *s*.

Here, the DeepMind team introduces a new approach that uses deep neural networks for position evaluation (‘value networks’) and for moves sampling (‘policy networks’). These neural networks, combined with Monte Carlo simulation, represents the new search algorithm that created AlphaGo, the program that defeated the human European Go champion.

# Methods

The neural networks were trained using a pipeline of several machine learning algorithms. The search algorithm then combines the policy and value networks in a Monte Carlo Tree Search (MCTS) algorithm.

## Supervised learning of policy networks

The first stage of the training pipeline consists in training a supervised learning (SL) policy network *p_{σ}* directly from expert human moves. The weights σ are trained in a 13-layer neural network on randomly sampled state-action pairs from a data set of 30 million positions of the KGS Go Server, using stochastic gradient ascent to maximize the likelihood of a human move to be selected in a given state. The network takes as input a representation of the board state and outputs a probability distribution over all legal moves. The SL policy network provides fast and efficient learning updates with immediate feedback and high-quality gradients.

A faster but less accurate rollout policy *p_{π} *is also trained using a linear softmax of small pattern features with weights π. This rollout policy is used for fast node evaluation in the search algorithm.

## Reinforcement learning of policy networks

The second stage of the training pipeline consists in training a reinforcement learning (RL) policy network *p_{ρ}* that improves the SL policy network by self-playing. However, to avoid overfitting that could happen if all games were played between the current policy network and itself, the games are played between the current policy network *p_{ρ}* and a randomly selected previous iteration of itself, in that way stabilizing the training.

The weights ρ, initially set to σ, are trained using stochastic gradient ascent to optimize the final outcome of self-play games. That way, the policy will adjust towards the correct goal of winning games, rather than maximizing predictive accuracy.

## Reinforcement learning of value networks

The final stage of the training pipeline consists in estimating a value function that predicts the outcome of the game from state *s* of all games played by the RL policy network *p_{ρ}* against itself. This value function is approximated using a value network:

The weights θ are trained by regression on state-outcome pairs, using stochastic gradient descent to minimize the mean squared error (MSE) between the predicted value and the corresponding outcome.

To avoid overfitting to the strongly correlated positions within games, a new data set of uncorrelated self-play positions was generated, consisting of 30 million distinct positions, each sampled from a separate game.

## Searching with policy and value networks

Once trained, the policy and value networks are integrated in an asynchronous MCTS algorithm. The search tree is traversed by simulation, each edge *(s, a)* of the tree stores an action value *Q(s, a)*, a prior probability *P(s, a)* and a visit count *N(s, a)*. In order to efficiently combine MCTS with deep neural networks, multiple simulations are executed in parallel on separate search threads. For each simulation, the MCTS algorithm proceeds in four stages.

**a. Selection**

The first in-tree phase consists in selecting an action for each node between the root and a leaf, so as to maximize action value *Q(s, a)* plus a bonus *u(s_t, a) *that is proportional to the prior probability *P(s, a)* but decays with the visit count *N(s, a)* in order to encourage exploration.

**b. Expansion**

When the simulation reaches a leaf node *s_L* for which the visit count exceeds a threshold, the leaf may be expanded to a new node that is processed once by the SL policy network *p_{σ}*. The output probabilities are then stored as prior probabilities *P* for each legal action.

**c. Evaluation**

The next in-tree phase consists in evaluating the leaf node *s_L* as a weighted average of parameter λ that mixes together the value network *v_{θ}(s_L)* and the outcome *z_L* of a random rollout played out until a terminal step using the fast rollout policy *p_{π}*.

**d. Backup**

The final in-tree phase consists in updating the action values and visit counts of all traversed edges of the tree by a backward pass. Once the search is complete, each edge accumulates the visit count and the mean evaluation of all simulations passing through it, and the algorithm chooses the most visited move from the root.

# Results

The strength of computer Go programs can be evaluated by running an internal tournament and measuring the Elo rating of each program. All Go programs were executed on their own single machine with identical specifications, and received a maximum of 5s computation time per move.

The results of the tournament show that AlphaGo is significantly stronger than any previous Go programs, winning 494 out of 495 games. To go even further, AlphaGo also played games with four handicap stones, that is free moves for the opponent, and won more than 75% of these games against the strongest Go programs (Crazy Stone, Zen and Pachi).

To prove the efficiency of their value network for position evaluation, they also played games with variants of AlphaGo where the position evaluation only used this value network (λ = 0). Even in that case, AlphaGo exceeded the performance of all other Go programs, even though a mixed evaluation (λ = 0.5) performed best.

Finally, a distributed version of AlphaGo exploiting multiple machines, that won 77% of the games against single-machine AlphaGo and 100% of its games against other Go programs, was evaluated against European Go champion of 2013 to 2015, Fan Hui, in a formal five-game match. At the end of the match, AlphaGo was the first computer Go program to defeat a human professional player, with a victory of 5–0.

# Discussion

The aim of this work was to demonstrate the feats of the artificial intelligence by addressing one of the most complex two-players strategy games, the game of Go. The Google DeepMind team developed a Go program that plays at the level of the strongest human players, a feat previously thought to be at least one decade away. Go was a grand challenge for artificial intelligence because of its challenging decision-making task, its intractable search space and its optimal solution being so complex that it appears infeasible to make a direct approximation using a policy or value function.

However, the DeepMind team developed, for the first time, effective move selection and position evaluation functions using deep convolutional neural networks trained by a novel combination of supervised and reinforcement learning. These neural networks were combined with Monte Carlo rollouts in a high-performance tree search engine.

By achieving to reach a professional level in Go, AlphaGo has provided hope that human-level performance can now be achieved in other seemingly intractable artificial intelligence domains.