Welcome back! In our last installment, we briefly reviewed of how DeepMind’s AlphaZero algorithm quickly learns to play games better than the best humans (and sometimes the best existing algorithms). Recall that it uses a neural network-based Monte-Carlo Tree Search (MCTS) to play against itself, and uses the results of those self-play games to further train the network, in a self-reinforcing cycle.
In the next few posts, we’re going to share some lessons we learned from teaching it to play Connect Four.
Why Connect Four?
To implement and test AlphaZero, we first had to choose a game for it to play. Unlike the games Go (at one extreme of difficulty) or tic-tac-toe (at the other), Connect Four seemed to offer enough complexity to be interesting, while still being small enough to rapidly iterate on. In particular, we chose the common 6x7 variant (as pictured above).
It also has another benefit: being a fully solved game, we could test our model against the optimal strategy. That is, once we trained our network using AlphaZero, we could feed it test board positions for which we know the correct answer, to get a meaningful “ground truth” accuracy.
Additionally, it allowed us to discover how close to perfect our network could become with a given architecture. Let’s take a look at how that works.
The training and evaluation process goes like this:
- Use a Connect Four solver (i.e., program that will tell you the correct move for any board position) to generate labelled training and test sets.
- Select a neural network architecture. The AlphaZero paper gives a good starting point, but it stands to reason that different games will benefit from tweaks to it.
- Use supervised learning to train the network the labelled training set. This gives us an upper bound on how well the architecture ought to be able to learn.
- Train AlphaZero using the same network architecture, and evaluate it with the labelled test set, to see how close it gets to that upper bound.
Suppose the AlphaZero-trained network achieves a test accuracy of 90%. What does this tell us?
- If the same architecture achieved 99% with supervised training, then the network can learn more than AlphaZero taught it, and we should improve the self-play component of AlphaZero.
- If supervised training only gets to 91%, then the network architecture itself is probably the limiting factor, and should be improved.
Using this strategy helped us tune the network architecture, find bugs, and make other improvements in our MCTS self-play.
When generating labelled training data, we feed random board positions to a Connect Four solver. The question is, what constitutes “random?”
Because AlphaZero should learn to play only perfect games, its network will eventually stop seeing poorly-played positions. Therefore it would be unfair to test it on them. On the other hand, Connect Four is simple enough that there are only a small handful of perfect games, so it’s not possible to build a large training set using only perfect board positions.
The other difficulty is determining what constitutes a correct answer for a position. In any given position, there may be multiple moves that lead to a win. Are faster wins better? How much better? If there is not a guaranteed win, should you pick the move that loses most slowly (assuming a perfect opponent), or one that has the highest potential to trip them up (if they are imperfect)? How do you make that precise?
In the end we decided to generate our labelled sets using positions from highly imperfect games, putting our AlphaZero-trained network at a disadvantage in the comparison. Yet it did surprisingly well with this handicap, as we shall soon see.
We also decided to use two different definitions of “best move.” When generating “strong” training/test sets, we label a move correct only if it leads to the fastest possible win (or slowest possible loss). In the “weak” case, all winning moves are given equal weight. To keep the comparison fair, we also trained AlphaZero in strong and weak modes, encouraging it to prefer faster wins in the former case (by scaling final result values based on game length). Let’s look at that briefly.
Strong vs Weak AlphaZero
During MCTS simulation, non-terminal positions get an initial value from the network, indicating the expected game result (in the range [-1, 1]). Terminal positions, on the other hand, don’t need such an estimate; we can assign them +/-1 directly. This is fine for weak training, where two winning positions are considered equally good no matter how long it took to get there. For strong training, we scale terminal results by the game length.
The fastest Connect Four win is 7 total moves, and the longest is 42 (using up the whole board). Some quick algebra shows that if n is the actual game length, then the expression 1.18–(9*n/350) will yield a value in [0.1, 1] for a win. This makes MCTS prefer quick wins (and slow losses).
We ran two kinds of evaluations:
- Network-only: Feed the network a test board position and check whether its predicted move is correct.
- MCTS: Run MCTS (with 800 simulations) backed by the network and evaluate its preferred move.
Recall that we used two different training methods:
- AlphaZero (“AZ”) training.
- Supervised training (i.e., using labelled training data from the solver).
This gives us four combinations to evaluate: AZ-Network, AZ-MCTS, Supervised-Network, Supervised-MCTS.
And finally, there are two training modes (strong vs. weak). We’ll look at each in turn.
- Supervised-Network: 96.20%
- Supervised-MCTS: 97.29%
- AZ-Network: 95.70%
- AZ-MCTS: 96.95%
What this tells us is that:
- Our AlphaZero training does nearly as well as supervised training (despite being at the disadvantage mentioned previously).
- Although the network alone does remarkably well, using it in MCTS further reduces the error rate by about 29% (from ~4% to ~3%).
- Supervised-Network: 98.93%
- Supervised-MCTS: 99.79%
- AZ-Network: 98.83%
- AZ-MCTS: 99.76%
This time, AlphaZero training got even closer to supervised training, nearly matching it. Additionally, MCTS reduces the error rate by a whopping ~80%, getting nearly every position correct.
In the weak test set, there are on average 4.07 correct moves in a position (compared to 2.16 for the strong test set). This means that randomly guessing amongst the 7 possible moves would only get us to 58.1%.
Connect Four is a great game to use for training (and finding bugs in) your implementation of AlphaZero. It allows for quick iteration, and is complex enough to be interesting. Since it is a solved game, it’s also possible to get an objective measure of how successful the training is.
When only looking for any optimal move (regardless of how quickly it wins, or avoids losing), and coupled with MCTS, the AlphaZero-trained network only makes a wrong move once in 400 positions. Even when having to find the fastest win (or non-loss), and using the network prediction alone, it only makes mistakes once in 25 positions. Pretty good!
In coming weeks, we’ll look at some of the tweaks we made to get the best results. See you then!
Part 3 is now published here.