“A Lego stormtrooper on sand” by Daniel Cheung
Jonathan Hui
May 13, 2018 · 16 min read

How does reinforcement learning join force with deep learning to beat the Go master? Since it sounds implausible, the technology behind it must be impossible to comprehend. In reality, it is not. If you know how to train a CNN classifier, you will have the knowledge to understand most of the material here. In this article, we cover how AlphaGo is trained and how it makes moves to beat the human.

Imagine that you have a Go master named Yoda as your advisor in a tournament. Whenever it is your turn, you ask Yoda for the next move. After a while, you decide to collect all his previous games and imitate whatever Yoda does. However, there are way too many move combinations (states). So we model a function f which takes a Go board and outputs a probability distribution over all possible moves. To make a move, we can pick one with the highest probability or sample a move from the probability distribution.

Sound odd but it works, and the basic concept is just like a deep network classifier.

In fact, we repurpose a deep learning classifier to model f. The classifier composes of 13 layers containing alternative convolutional filters and rectifiers followed by a softmax classifier. Since this network is created by supervised learning, it is named SL policy network.

A Go board has a 19 × 19 grid. The following are 2 different board positions (states).

2 different board positions

However, the classifier does not take a raw image. It takes a 19× 19 × 48 input feature to represent the board. But these features are pretty simple and easy to derive. (No complicated handcrafted features)

Input features to the SL policy networks

Let’s introduce a few RL terms so the equations will be less intimidating. action a is our move, state s is our board position and policy p is the probability distribution of taken action a.

Training

To train the SL policy network, AlphaGo collects moves for 30 million board positions. Then it applies the backpropagation in deep learning to train the model parameters σ. (This is the same way you train a deep network classifier.)

AlphaGo uses 50 GPUs to train the network and it takes 3 weeks. The SL policy network achieves a 57% accuracy. It sounds not too accurate but this policy network can beat an advanced amateur already.

Rollout policy

During game simulation (discussed later), we need something faster to narrow our search for moves. We create another policy called rollout policy π which use a more simple linear softmax classifier. It takes 2 μs to compute each move instead of the 3 ms in the SL policy network. This rollout policy has a lower 24.2% accuracy (compared to 55.7%) but it is 1500x faster.

Reinforcement learning (RL) of policy networks

Now we have a policy network approximate the master moves. But it is not accurate enough to beat a master yet. In the second stage of training, AlphaGo starts playing games with itself to formulate better policy.

First, we duplicate SL policy network and call it RL policy network ρ. We use the policy gradient reinforcement learning to improve RL policy network iteratively. We start playing games using the new RL policy network. For our opponent, we randomly reuse one of the older RL policy networks. We don’t play each other with the same policy network because it will overfit the current policy. (i.e. it will memorize the moves rather than generalize it for different opponents.) We play until the game is finished. For time step t, the game result zt is equal to 1 if we win at the end or -1 if we lose. We modify the model parameter ρ to make the RL policy easier to win. In fact, this is almost the same as the deep learning (DL) backpropagation except we use the game result z to determine which direction to move.

The RL policy network takes about one day to be trained with 50 GPUs.

Reinforcement learning (RL) of value networks

Master Yoda is hard to understand and worse unpredictable. The RL policy network suffers the same issue. The result of a Go game can vary with a small change. A small “inaccuracy” in the RL policy network can change a win to a loss. A human prioritizes the search for moves and evaluates a board position accordingly. In the last stage of AlphaGo training, we want to duplicate the human capability in evaluating a board position. We train a deep network to estimate the value of our positions. In Go, the value equals 1 if we win or -1 if we lose. Later, we complement our policy network and value network with each other to search and to make better moves.

In RL learning, value measures how good to be in a state s. We often refer value as value function.

To train the value network, we play games against each other using the same RL policy network. The design of the value network is similar to the policy network except it outputs a scalar value instead of a probability distribution. We compute the mean square error MSE of our value function v(s) with the game result z. Just like supervised learning, we use backpropagation to modified our model parameters θ.

From the start to the finish of a game, we can collect many board positions from the move sequence and add them to the training dataset. However, we do not collect more than one board position per game. All board positions in a game lead to the same result (a win or a loss), they are strongly related. To train a model effectively, we want samples in the training dataset to be independent. Therefore, we use the RL policy network to play more than 30 million games and collect only one position from each game into the training dataset. The AlphaGo value network is trained with 50 GPUs for one week.

Here is a visualization of the prediction from the policy network and the value network for the corresponding board position. Similar to human, AlphaGo starts the game near the corner.

Recap

So what do we achieve so far? Our training is done and we use deep learning and reinforcement learning to build

  • a policy network that tells us what moves are promising, and
  • a value network that tells us how good a board position is.

We know both networks are not perfect and tiny mistakes may change the game result. Our last task is to complement both networks with each other to make better searches for our final move.

Monte Carlo Tree Search (MCTS)

Intuition

Simulating moves suggested by the policy and the value network is the best way to find the next move. We simulate games starting from the current board position. We search for moves and play simulated games until the end. If we play enough games, we will get an idea which move likely gets the most wins. In the process, we build a search tree recording sequences of moves we simulated and how many wins for each move. But the search space for Go is too large, we need to prioritize it so we can find the best moves in fewer games. Luckily, we get great advice from the policy and the value network. We use both information to prioritize our searches.

But that is not enough. We should explore moves that we know little in our simulation so far (exploration). Maybe the advice is off so this helps us to verify moves better. On the other hand, if we win a simulated game, we want to try the corresponding moves more often (exploitation).

Hence, our search priority depends on:

  • predictions from the policy network,
  • predictions from the value network,
  • how many times we have pick the move, and
  • simulated game results with wins.

MCTS

Monte Carlo Tree Search (MCTS) is the algorithm we use to prioritize and build this search tree. It composes of 4 steps below.

Selection

MCTS simulates many games to find out how good are the moves. The purpose of the selection step is to prioritize moves for further simulations. If our priority is correct, we can play fewer games to pick a better move.

For each simulation, we first select a path in our search tree. (We will discuss later on how to build and to expand this search tree). Let’s assume our search tree looks like the one on the left-hand side below. The top of the tree is our current board position and the tree has 3 more board positions. Each edge is the move it takes to get from one position to another. i.e. each edge (s, a) represents an action a to transit from state s to state s’. On the right side, we compute Q + u for all nodes and pick the children with the greatest value for each parent. This is the promising path (the red arrows) that we want a further simulation.

Q and u are the functions defined below. Q is for the exploitation and u is for the exploration.

Don’t get scared! The first equation means picking the action a (move) that have the highest value of Q + u. If we unrolled Q & u, our selected move depends on:

These are the same criteria discussed in the intuition section before: just formulate them in equations only.

For every move (edge) we selected in the path, we increment the visit count N(s, a) by 1. The equation 1(s, a, i) below equals to 1 if the move (s, a) is selected (0 otherwise). (i represents the simulated game i.)

The rest of equations will be discussed later.

Expansion

Our search tree starts with the current position.

Then we add more positions into the tree reflecting what moves (edges) we have tried.

Whenever we expand a tree, we initialize new edges (s, a) with:

P(s, a) is the probability indicating how good the move a is. It is initialized by the SL policy network and will not be updated anymore. It is used in the exploration in computing u. You may ask why it is not initiated by the RL policy network. It turns out we want to explore more diversify moves. The top moves in RL policy network are good but less diversified. SL policy network is learned from human experience. It turns out human tries more diverse options in exploring moves.

In the following example, our selected path is in red.

We add a new leaf node (SL) to expand the tree. We will compute its value function from the value network.

And we initialize all the edges from the leaf node and compute P (aka P(s, a)) using the SL policy network σ.

So what is Q? We use it to pick the path in the selection. Why do we use Q instead of P to evaluate how good a move is?

Both P and Q(s, a) both are called action-value function. It measures how good to take a move. However, P is computed from the SL policy network only. Q runs a weighted sum for the leaf nodes’ value functions

and the game results z.

So Q combines more information. Also, the value function is computed from the value network θ. It is trained by the RL policy network which is considered better than the SL policy network. So Q is more accurate and we use it for exploitation. P is more diverse and we use it for exploration in calculating u.

Evaluation

In the evaluation phase, we simulate the rest of the game using Monte Carlo Rollout starting from the leaf node. It means finish playing a game using a policy and find out whether you win or loss.

In this case, we use the rollout policy instead of the SL policy or the RL policy. Even it is lower in accuracy but it is 1500x faster. We need the speed to simulate many rollouts. To play the rest of the game, we sample moves using the rollout policy.

Backup

After the evaluation, we know whether our moves win or lose the game. Now we compute Q to remember how well to make a move from the game results and the leaf node’s value function.

We update Q with:

i.e. Q is a weighted average on the result z of simulate games and the value function v(s) of the leaf node.

After many game simulations, we should have a reasonable estimation on how well to take a move based on Q. Nevertheless, AlphaGo does not use the Q to determine the next move for the current board position. It uses the move with the highest N (how often we take this move in the game simulations). As a simulation progress, exploration decreases and exploitation dominates. The most frequent visited move is the most promising one also. But N is less prone to outliner than Q.

After so many steps, we finally decide the move to beat the master!

This search tree will be reused for the next move. The selected move becomes the root. We keep all the nodes and their statistics under the current root and discard the rest.

What is the RL policy network for?

If you look at the equations in MCTS, the RL policy network is not used to select any moves. We use the SL policy network for exploration. The value network and the game results for exploitation. We spend so much effort training RL policy network. Why are we not using it? Actually, we do. We use the RL policy network to train the value network that is heavily used in the exploitation.

How to run the MCTS?

To run MCTS efficiently, the searches (simulation) are run on 48 CPU in 40 threads. The policy and value evaluations run on 8 GPU.

What AlphaGo improves?

The search space for Go is too big. To win, Go players need to prioritize searches and to evaluate positions very well. Using supervised learning, we create a policy network to imitate expert moves. With this policy, we can play Go at the advanced amateur level. Then, we let the policy network to play games with itself. Using reinforcement learning, we apply the game results to refine the policy network further. We also train a value network to evaluate positions. Our training is done. But we cannot beat the masters yet. To decide the next move, we simulate games to find the best move. But it is not that simple. We use the policy network and the value network to narrow down the search. To mitigate errors in our evaluation, we compute a weighted average from our game results and our board evaluation. Even we can only simulate a limited number of games, we hope it is enough and the weighted average will be more accurate. AlphaGo is right. This strategy beats the Go masters. In the table below, we see how AlphaGo improves its Elo rating (a Go player rating) in applying different combinations of policy network, value network and Monte Carlo Rollouts.

Why AlphaGo win?

I am no expert in Go to give the right answer. But if you watch Ke Jie (a Go champion) interviews on his match with AlphaGo, you may find some clues. A small part may be contributed to the player anxiety. But largely, the way to beat an opponent does not work anymore for AlphaGo. A human develops insight to solve problems. We recognize patterns, and apply learned solutions (often subconsciously) or avoid them. The pre-trained policy and value networks are the AlphaGo insights. AlphaGo has more computation power, and it can develop models that can be more subtle than us. On the human side, our mental capacity is limited, we break down the game plan into intermediate goals and define strategies and tactics to achieve them. In AlphaGo, “win” or “loss” is the only objective function in the calculation and it uses long sequences of lookahead learned from the simulated games. AlphaGo objective is much direct. If it can pull it off, the solution will be more optimal than our objectives. No wonder Jie feel AlphaGo moves are well-rounded. AlphaGo uses different methodology to create moves. It can look further ahead. Half of the moves are not what Jie guessed. Human insight is still useful but it is not in full gear. In fact, we can learn from AlphaGo to develop new insights and new moves. In additional, if we want to win, we need to explore opponent weakness. By no mistakes, deep learning can make far more stupid mistakes than human. But since we train the deep network with a stochastic method, it is more random and much harder to exploit without access to those million parameters in the deep network model.

Human losses the Go game to AI. But it is not a fair comparison because our approaches are different. A human does not have the computation speed of computers, and therefore we rely on abstract thinking and analysis to solve problems. AlphaGo can take advantage of the computation power, and little less on the abstract thinking part. Researchers find ways to search the Go space far more efficient. The technology is actually not too hard to comprehend but there are a lot of details to make it happens. AlphaGo proves that by combining raw speed, deep learning, and reinforcement learning, it can beat us.

We may say we make the computer smarter. But I prefer to say intelligence is simpler than we thought.

Further readings

In late 2017, DeepMind released AlphaGo Zero. This is a must read for those interested in AlphaGo. AlphaGo Zero not only beat AlphaGo easily, it does not need any real games or human domain knowledge to train the deep network.

Reference

Network design

The policy network composes of 12 hidden layers of convolution filters with zero padding and stride 1 to maintain the spatial dimension. The network takes 19×19×48 input features to represent the 19×19 board. The first layer uses 5×5×49×192 filters while the rest use 3×3×192×192 filters. For each layer, we add rectifiers to introduce non-linearity. The final layer is a 1×1×192×1 filter with different biases for each location followed by a softmax function. The value network looks similar. But hidden layer 12 is an additional convolution layer, layer 13 is a 1×1×192×1 filter and layer 14 is a fully connected layer with 256 rectifiers. The output layer is a fully connected layer with a single tanh output. AlphaGo uses 192 filters with the highest accuracy and an acceptable processing time:

Paper & credits

Most of the tables and figures are modified from the original paper:

Mastering the game of Go with deep neural networks and tree search

Ke Jie and Fan Hui on AlphaGo

Search algorithm

In the AlphaGo implementation, it adopts 2 counters to count the visit. Because the simulation is run in multiple threads, the extra counter is artificially inflated temporarily so other threads are discouraged to pick the same moves when this thread is still running. This encourages parallel threads to explore different moves at the same time.

During expansion, we add a leaf node to the search tree when the parent is selected in our path. In AlphaGo, a node is added only if the parent is visited more than 40 times (controlled by the expansion threshold hyperparameter). But this threshold will be self-adjusted in runtime. And the node to be added will be selected by another policy similar to the rollout policy but using more input features.

Hyperparameter configuration

Expansion threshold and exploration constant are higher than the illustration examples. It indicates the important of exploring in tree search to beat the master.

Rollout policy features

For 1500x faster processing speed, AlphaGo uses a rollout network instead of the policy network to sample moves in the Monte Caro rollout. However, it needs to handcraft input features to improve the accuracy.

Visualization

Let’s see what moves (the red circle below) are recommended by different methods.

Diagram a below is the estimation from the value network. We show the positions with the highest values from the value network. The red one has the highest value from the value network. Diagram b is the value Q calculated from MCTS when we use the value network but ignore the rollout result (λ = 0). Diagram c use the rollout result but ignore the value network (λ = 1). Diagram d is from our SL policy network. Diagram e is from the count N. Diagram e is the path with maximum visit count.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade