The game of Go and AlphaGo

Sara Ghibaudo
Betacom
Published in
7 min readNov 8, 2021
Photo by Elena Popova on Unsplash

Introduction

Welcome to this new article on reinforcement learning. Today we will introduce the game of Go and we will look at some algorithms written by DeepMind, a British artificial intelligence company, to play this game.

Let’s get started!

The game of Go

First of all, let’s introduce the game of Go. It is a Chinese deterministic strategy game similar to chess but, even if the rules are simple, the game plans could be very complicated since there are many alternatives for each move as we can see in the image below.

Comparison between Go and chess
Photo by Elena Popova on Unsplash

To play Go, we need a grid of black lines, usually 19x19, and some white and black playing pieces, called stones, that are played on the lines’ intersections as we can see from the image above. Each player aims to surround more territory than the opponent using the stones. The players, alternatively, put a playing piece in an empty intersection of the board and, after that, they cannot move it. The only situation in which they can take a stone is if it is surrounded by the opponent’s ones and, in this case, the stone is captured. As we stated, the game is very simple, indeed there are only two rules: each stone on the board must have a liberty point, i.e. it cannot be surrounded by stones of a different colour, or it must be part of a connected group with one liberty point. Moreover, the stones on the board cannot repeat a previous configuration of the game. The match proceeds until neither player wishes to make a move or if one of them resigns.

It is one of the most complex games due to the high number of variations of individual games. Indeed, the large table and the lack of restrictions allow a lot of possible moves, moreover, decisions made on a part of the board can affect a distant region of it.

At this point, we can look at a couple of innovative algorithms written by DeepMind to play Go.

AlphaGo

An intuitive idea that we can follow to write an algorithm that plays Go is to build a tree that contains all the possible states of the game, so all the possible configurations of the stones on the board. We can label the node with a value that specifies, for example, the number of won matches that visited that state. In this way, it is possible to find the optimal strategy by choosing, at each step, the action that leads to the best node’s child. However, the tree can be huge for problems with many available alternatives for each move. Indeed, in the 11th century, it has been estimated that the number of possible board positions is around 10¹⁷², which is even greater than the number of atoms in the observable universe (10⁸⁰). For this reason, we need to employ different and more complex techniques.

The first algorithm that we will analyse is called AlphaGo. It is “the first computer program to defeat a professional human Go player, the first to defeat a Go world champion, and is arguably the strongest Go player in history”, as we can read on the dedicated page of DeepMind. To solve such a problem, we need to employ very advanced techniques like reinforcement learning (check the introductory article if you miss it!). Indeed, the creation of a tree with all the possible states of the game and the searching of the optimal strategy using uninformed search is unfeasible due to the high complexity of the game and the many available alternatives for each move.

Let’s look at a general description of the algorithm.

As we stated, we need to employ reinforcement learning to solve the problem. In particular, AlphaGo uses Monte Carlo Tree Search (MCTS) and two deep neural networks: a value network that outputs the probability to win from our position evaluating the goodness of the actual configuration of the game, and a policy network whose output is a vector of probability defined over all the available actions from the current position and used to select moves.

MCTS is a policy search algorithm used to find the best action from each configuration of the game by estimating the action-value of each couple of state and action (s, a). The action-value defines how good is for an agent to be in s and choose a and, if it is high, there are good chances to win the match. Moreover, MCTS builds recursively an asymmetric tree since it focuses just on the best nodes discarding the part that corresponds to the worst ones. Thank to this approach, we have to deal with a tractable and much smaller tree. So, starting from the root, the algorithm descends the tree by selecting actions with high value but, at the same time, it favours the exploration of new states. To do so, it uses a quantity that is proportional to the number of visits to the node and by a prior probability of selecting the action that is returned by the policy neural network trained with supervised learning to predict expert moves. Once the algorithm reaches the most promising node, which is the one that it wants to study, it selects one of this state’s children. At this point, an estimation of the state-value is got using the value neural network initialized with supervised learning and trained with reinforcement learning from games of self-play. Moreover, a simulation is run until the end of the episode, the winner is stored and the state-value is defined by averaging the real outcome and the estimation return by the neural network. Finally, the state-value is backpropagated in the just visited nodes by updating their values. In this way, in the next complete simulation from the root, we will select actions considering these updated and more precise values.

This is a very innovative and powerful algorithm, indeed in 2016, it beats the South Korean professional 9-dan player Lee Sedol four to one. This match was filmed and it is part of the interesting documentary AlphaGo.

AlphaGo Zero

A further improvement of AlphaGo is its successive version called AlphaGo Zero that uses just reinforcement learning without human knowledge. The problem with the use of supervised learning is that expert data are expensive and sometimes unreliable or unavailable, so it is better to avoid them and use RL only.

AlphaGo Zero is trained on its past experience and, in principle, it can overcome human abilities and can succeed where human fails. In contrast to AlphaGo that employed two neural networks, AlphaGo Zero uses only one residual network. Its input is a tensor that represents the current and past position of the black and white stones on the board. The use of a single NN is a groundbreaking choice since it combines policy and value networks in a single architecture with two outputs. Moreover, it uses, like AlphaGo, the MCTS that employs this single NN as a kind of tree policy to evaluate positions and to select a move. However, it does not use any Monte Carlo simulation as AlphaGo did, so, the state-value is defined just by the estimation return by the neural network.

The output of the NN is the couple (p,v) where p is the policy vector defined over all the available actions from the state s that the network takes as input, and v is the value of s, so the probability of winning from it. So, using MCTS, the algorithm defines a vector of probabilities 𝞹 of picking each action from each state. Usually, 𝞹 selects more strong moves than the ones proposed by the network, therefore Monte Carlo Tree Search can be seen as a policy improvement operator. For this reason, the parameters of the NN are updated to get the action probabilities and value (p,v) more closely to the search probabilities and self-play winner (𝞹,z). These new weights are used in the next iteration of self-play.

Self-play. This image is taken from “Mastering the game of Go without human knowledge” (Silver et al.2017)

In the above image, we can see a representation of self-play. Starting from sₜ the algorithm executes an MCTS using the latest NN, it gets a search probability vector 𝞹 and it chooses action aₜ according to it. It proceeds in this way until the end of the game when the last state is scored according to the rules to compute the winner z.

At the end of the search, AlphaGo Zero will update the vector 𝞹 to assign to action a a probability proportional to the number of times a was selected in the descend of the tree. The game ends when one of the two players resigns, so when they have a value below a certain threshold.

Finally, we can evaluate the performance of AlphaGo Zero and notice that it outperformed AlphaGo after just 36 hours, meanwhile, this was trained over several months. Instead, if we compare the accuracy in the prediction of professional moves using supervised learning techniques and reinforcement learning techniques, the former is better than the latter. Since AlphaGo Zero has better performance, we can imagine that it learns qualitatively different strategies from the ones used by humans.

Conclusion

In this article, we analyzed a famous and powerful algorithm. After AlphaGo Zero, DeepMind published other papers to introduce AlphaZero and MuZero that can be used to play other games or in more complex situations. Indeed, all these techniques can be applied to solve other problems in different settings.

See you next time!

References

--

--