In the process of developing GraphPipe at work, we selected a variety of use-cases to help us better understand the characteristics of distributed deep learning work flows. A particularly interesting test case was AlphaZero, which we decided to implement based on its need for large-scale distributed computation.
What exactly is AlphaZero? AlphaZero is an algorithm developed by DeepMind as a generalization of AlphaGo, the famous computer algorithm that beat the world’s best Go player. AlphaZero is particularly interesting because it can learn any perfect information game, such as Chess, Go, Reversi, and — you guessed it: Connect Four. And it is able to learn how to play these games without any knowledge about game strategy — all that it needs is the game rules (well, and and maybe a little parameter tweaking). But for the most part, the details about how to actually play the game are figured out by have the AI play thousands or even millions of games against itself.
As part of our open source release of GraphPipe, we provided pre-trained models for Connect Four. These are the models that constitute the AI behind AZFour.
If you are familiar with Connect Four, you should be able to start playing right away, but there are some additional controls on the page that are designed to help you explore the progression of learning throughout AlphaZero training.
AZFour lets you play against a selection of neural networks that have had varying amounts of AlphaZero training. The UI looks like this:
Let’s look more closely at each of these game control components.
The area immediately below the game board represents the neural network’s evaluation of the game board from the perspective of the current player. Famously, the AlphaZero algorithm trains its network with two outputs: Policy and Value, which are dsiplayed in the app like this:
- Policy Output: In AlphaZero, the policy output is the algorithm’s evaluation of the current move choices. In the AZFour UI, higher percentages are associated with moves that the algorithm thinks are best.
- Value Output: This is the neural network’s belief about what the game outcome will be. To make this value a bit more tangible for users, the UI represents this output as text (like ‘I am 71% confident yellow will win’).
In the model dropdown, there are a selection of models labeled as Generation 1 through 50, where higher generations correlate with better playing ability. For example, a model that was only trained with one batch of data (aka 1 Generation) plays more or less randomly, while a model trained for 50 Generations is quite strong. You can select a different model for each player using the Model dropdown.
You can control the skill of a particular model by using the Skill slider:
But what exactly does this slider do? What does ‘Skill’ mean?
When an AlphaZero network evaluates a position, its policy output generates a probability distribution across all possible moves. During typical competitive game play, the computer selects the argmax of this policy distribution as the next move.
However, to make AZFour more fun for humans (so that play varies from game to game, and so that the AI can show some weakness), the app is configured to select its next move probabilistically, but still based on the weights of the policy distribution. This is articulated with a setting called “Skill”, which mimics the temperature parameter (τ) of AlphaZero to control the shape of the policy distribution before selecting a move. To illustrate, here is the plot of the reshaped policy distribution after various Skill values have been applied to the policy output:
As you can see, the higher the Skill setting, the more Move Choice resembles argmax.
If you select the Auto-Play checkbox, the computer will automatically play moves for that color. By default, Auto-play is checked for Red, which means that when the page loads you are playing as Yellow.
You can play against the computer as either player. You can even select Auto-play for both players and watch the computer battle itself!
If you think you can beat the computer without any AI assistance, feel free to Hide Hints. This will hide the move percentage hints, and you’ll be on your own!
To illustrate how the network’s beliefs about board positions change as training progresses, consider the following position:
Using the Generation 3 model, Yellow has determined that its best move (with 29% certainty) is in the center position, which is clearly a losing move, since if red goes in either column 2 or 5 it will have 2 ways to win.
By generation 7, the model evaluates the position like so:
Now, the model correctly identifies that it needs to block the red player from constructing an open-ended 3-in-a-row position, but thinks that it does not have an advantage. Now let’s see what our best model thinks about the position:
Yellow still identifies that it should choose position 2 or 5, and has nearly eliminated any other move as a possible play. Further, it now believes that Yellow will win this game. Indeed, this is a winning board position for Yellow (after 19 moves).
To see exactly how I deployed the AZFour production site, read my next post.
Want to see more escapades with the AlphaZero algorithm? Check out these posts: