Lessons From AlphaZero (part 4): Improving the Training Target

This is the Fourth installment in our series on lessons learned from implementing AlphaZero. Check out Part 1, Part 2, and Part 3.

One novel innovation in our AlphaZero approach involves the target for the value output of the network. We have found an an alternative training target for the network’s value output that outperforms the AlphaZero approach.

Training Target Review

The original paper specifies two outputs for the neural network: a policy output and a value output. The policy output, which predicts the best move in the current game position, is trained against a value represented by π, which is a probability distribution based on the visit count accumulated by MCTS search during self-play. The value output, which predicts the result of the game, is trained against the value z, which is the result of the self-play game from the perspective of the current player. In other words, if the current player ultimately won the game from which the position was sampled, the z value for training would be +1, and if it lost the game the z value would be -1. Drawn games are scored as 0. If this description is confusing, the following diagram may help:

Lets unpack the process for generating a training target using z during the AlphaZero self-play process. Starting with an empty board (and player 1 to move) a game is played as follows:

  1. Run 800 simulations from the current state.
  2. Generate a policy from the visit counts of the children of the current state.
  3. Probabilistically play a move based on the policy, resulting in a new state.
  4. Perform the above two steps for the opponent.
  5. Repeat until the game is over. The game result is in {-1, 0, 1}.

At the end of the game, every state gets labelled with the policy and the game result (taking care to negate the value for positions where it’s player 2 to move). The two outputs of the neural network are trained against this policy (π) and game result (z). If the MCTS simulation portion is still fuzzy, there is a very detailed explanation here.

An Alternative Training Target

There is another potential target for training the value output. During the MCTS search, each node accumulates an expected result of the game through the backup step. This value is known as the Q value for the node and is simply W/N, where W is the total score that has been propagated up the tree during simulation and N is the number of visits to the node. The root node of the search tree represents the current position in the game and therefore its Q represents the expected result of the game from this position. When we save the π value based on the visit count, we could also save the Q value of this root node as q and train the network against q instead of z. The process for generating a training target using q can be seen below:

The self-play steps for generating this training target are exactly the same as the five steps listed above. The only difference is in the labelling process at the end of the game. Instead of using the game result, every state gets labelled with the policy and the root node’s Q value. The two outputs of the neural network are trained against this policy (π) and Q value (q).

A similar target was used in Thinking Fast and Slow with Deep Learning and Tree Search, with one crucial difference. Their MCTS used full playouts, whereas AlphaZero uses a truncated policy, backing up values directly from the neural network predictions instead of playing out the game for each simulation. The authors suggest this training target is a good proxy for z, but requires a smaller dataset to be significant.

The Difference Between z and q

When training with either z or q, we are attempting to encode information learned from MCTS into the network, but there are meaningful differences between the two approaches. Training against z attempts to encode the expected result of the game by condensing all of the simulations from the playout into a single discrete value: win, loss or draw. In contrast, Training against q attempts to encode the expected result of the game as a continuous value using only the 800 simulations from the current position.

Based on these descriptions, z seems superior, but there is one large drawback of using z: each game has only a single result, and that single result can be heavily influenced by randomness. For example: imagine a position early in the game where the network has made the correct move, but later ends up choosing a suboptimal move and losing the game. In this case z will be -1 and training will incorrectly associate a low score with the position.

With enough training data, one would hope that these mistakes get overshadowed by correct play. Unfortunately, it is impossible to completely eradicate the mistakes, because the network explores during self-play due to its probabilistic policy. We theorize that this is one of the reasons dropping the temperature after 30 moves was so important to AlphaZero. Otherwise, randomness in move choice towards the end of game play could compromise the accuracy of z.

Training against q doesn’t suffer from the randomness problem. It doesn’t matter if the network ultimately lost the game. If the simulations give a good result for a move, the network will train against a positive value. In one sense, you can think of q as the average of 800 self-play games instead of one. These “games” are guesses by the network so they aren’t as accurate as 800 true z values would be, but they can achieve more consistency than the z values with a much smaller set of self-play games.

Unfortunately, q is not a perfect solution. It can suffer from something called the “horizon effect”. This can happen when the simulations return a positive result, but there is a killer response that is just beyond the search horizon, i.e. not visited during the 800 simulations. Additionally q is somewhat meaningless for early moves in the first few generations of training because the network doesn’t know how to evaluate the position. In this case q stays close to 0 and it can take many generations for the value output to become significant.

Testing the New Target

Our conclusion from the above analysis is that both z and q have the potential to work, but each may suffer from unique drawbacks. Empirically, we can train a network successfully using q instead of z. In fact, for a short game like Connect Four, training against q works slightly better than training against z. But, can we use our understanding to significantly improve AlphaZero’s approach?

Our experimentation shows that we can achieve better results by using both q and z together. One way to combine q and z is to average them for each example position and use that average to train the network. This seems to give the benefits of both: z helps counteract q’s horizon effect and q helps counteract z’s randomness. Another promising approach is to begin by training against z, but linearly falloff to q over a number of generations.

During the early cycles of training Connect Four, both of the averaging and falloff perform equally well. Ultimately the falloff approach is able to achieve a slightly lower error percentage than averaging. Training speed with both of these approaches were significant improvements over using either z or q alone. Below you can see a graph of the error percentage achieved by the first 20 generations of our Connect Four model using each of the four targets. In the linear falloff case we start using 100% z for the first generation and transition to 100% q by the 20th generation.

We have begun experimenting with training using both q and z for longer games, and so far the results look promising. It would be fascinating to see this approach applied to larger AlphaZero efforts like Leela Chess Zero.

Part 5 is now out.