How deep is your mind, AlphaGo?
March 15, 2016, was the happy day for all AI enthusiasts and probably a sad day for the rest of the humans. This day AI-powered AlphaGo program (designed and trained by Google DeepMind lab) defeated Lee Sedol, one of the world’s top players with 9-dan rank, by four games to 1.
Many lances have been broken discussing what does this mean to all of us. Indeed, it is the first time in history when AI stood over the human in such variative and abstract game as Go. However, for engineering guys like us, it is more interesting to see what is under the hood of AlphaGo.
This post is based on Mastering the Game of Go with Deep Neural Networks and Tree Search from Google Deep Mind.
Why is AlphaGo the next big step?
The fundamental difference between AlphaGo and previous AI monsters likeIBM Deep Blue is that Deep Blue, as well as most chess-playing computers, used Monte Carlo Tree Search(MCST) algorithm. This is kind of brute-force approach (for sure, with a lot of smart tweaks), recursively computing the optimal move for every game state, that can’t be efficiently used in case of Go.
The game of Go has long been viewed as the most challenging of classic games for artificial intelligence due to its enormous search space and the difficulty of evaluating board positions and moves.
As it widely accepted, the number of possible combinations in Go exceeds the number of atoms in the universe. Indeed, the number of combinations in Go can be estimated as1.74×10172 (compare with an estimation of 1080atoms for the observable universe).
Due to this, the search over all possible positions is mostly infeasible. However, AlphaGo has gracefully solved this issue with the smart combination of MCST and complicated Deep Learning.
Basic AlphaGo’s architecture
AlphaGo uses two types of neural networks — value network to evaluate board positions andpolicy networks to select moves.
Since theoretical breads of Go (number of possible moves) is enormous, thepolicy networktries to minimize it by predicting most probable opponent’s move based on previous training results.
Value network gives an estimation of current position strength or in other words, rapidly estimates the probability of game winning for every given position.
As a result, AlphaGo evaluates thousands of times fewer positions than Deep Blue did by selecting positions more intelligently (using the policy network) and evaluating them more precisely (using the value network).
Both neural networks are trained by a combination of supervised learning and reinforcement learning from games of self-play. So let’s look how this training is done.
Supervised Learning of Policy Networks
The architecture of the policy networks is 12 convolutional layers. A finalsoftmax layer outputs a probability distribution over all legal moves. Since the purpose of the network is to predict the next opponent’s move it was trained on the database of the real Go games available online (~150K real games, ~30 million positions). Interesting, that this dataset did not include the games of top players but rather the games of Go-enthusiasts playing onKGS Go Server.
The network takes 48 inputs as described below
Achieved prediction accuracy is about 57% on a held-out test set, using all input features, and 55.7% using only raw board position and move history as inputs. By the way, these relatively poor results simply mean that humans are the subjects hard to predict.
Reinforcement Learning of Policy Networks
Next step is reinforcement learning (RL). On this stage, various versions of the policy networks play against each other. Reinforcement learning improves the SL policy network by optimising the final outcome of games of self-play. This adjusts the policy towards the correct goal of winning games, rather than maximizing predictive accuracy.
This significantly improves the strength of RL-policy network over initial implementation of SL-policy network.
Reinforcement Learning of Value Networks
The final stage of the training pipeline focuses on position evaluation predicting the outcome from the current position of the game. Value network has a similar architecture to the policy networks but outputs a single prediction instead of a probability distribution.
The naive approach of predicting game outcomes from data consisting of complete games leads to overfitting because successive positions are strongly correlated, differing by just one stone, but the regression target is shared for the entire game. When trained on the real games dataset, the value network memorised the game outcomes rather than generalising to new positions.
To mitigate this problem, AlphaGo’s team generated a new self-play dataset consisting of 30 million distinct positions, each sampled from a separate game. Each game is played between the RL policy network and itself until the game terminated.
Searching with Policy and Value Networks
Finally, AlphaGo combines the policy and value networks in an MCTS algorithm.
To efficiently combine MCTS with deep neural networks, AlphaGo uses an asynchronous multi-threaded search that executes simulations on CPUs, and computes policy and value networks in parallel on GPUs. The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs.