AlphaZero from scratch in PyTorch for the game of Chain Reaction — Part 1

Bentou
9 min readSep 24, 2022

--

Implementing one of the most fascinating AI algorithms from the last decade, exploring ideas that make it more efficient, and future directions.

In this article, we’ll be implementing DeepMind’s AlphaZero[1] from scratch in PyTorch for the game of Chain Reaction[2]. To make the learning process more efficient in AlphaZero, we’ll also be using a relatively recent improvement called as “Playout Cap Randomization”[3], and some other techniques from [4]. During training, we’ll be using multiprocessing to simulate multiple games in parallel. We’ll also be discussing some future directions for AlphaZero with some relevant research papers.

Please note that the objective of these posts is not to build the best Chain Reaction bot using AlphaZero (as that would require huge amount of computational resources), but rather to build a decent bot that can at least defeat a random agent and to understand how AlphaZero works using the game of Chain Reaction as an example.

The next section starts with explaining how Chain Reaction works. Feel free to skip the next section and directly move to the AlphaZero section if you just want to understand how AlphaZero works.

Chain Reaction

Let’s first start with understanding the game of Chain Reaction. It’s a perfect information game. We used to play this game with our friends in college and sometimes, the game after a few moves used to look quite chaotic and unpredictable to us. So we were curious about how powerful AlphaZero trained on this game would be. Chain Reaction can be played among many players but we will restrict ourselves to two players in this article.

Rules of Chain Reaction

Let’s start with the rules of this game. We have a board with M rows and N columns, and we have two players. Each player has an assigned color. For the purposes of this article, say we have a red and a green player, and red moves first. The picture below shows some intermediate states in the game.

The game of Chain Reaction — some intermediate states. Note that the board’s colour tells us whose turn it is.

We have M rows and N columns in the board, in the above picture, M=N=5. So, there are M*N = 25 cells in the board. All of the cells are empty in the beginning of the game.

Also, these red and green circular objects are called as orbs in the game. The picture below shows the orbs that we can have in the game (1, 2 or 3 in number, and red or green).

Chain Reaction orbs

In a move, the player clicks at any cell which is either vacant or is colored the same as that player, and it increases the number of orbs in that cell by one. The gif below shows some moves in the game.

Some moves in the beginning

There is a limit to how many orbs can reside in a particular cell. A cell can at most hold “Number of orthogonally adjacent neighbors of that cell -1”. For cells in the middle, that number is 3, for cells at the edges that number would be 2, and for corner cells, that would be 1. The picture below shows this maximum number of orbs for each cell in a 5x5 board.

Left: The maximum number of orbs that can reside in each cell. Right: All filled cells except the one in the center-left.

But what happens when a player clicks at a cell which already holds the maximum number of orbs? The orbs in that cell split, pushing all its orbs to its neighboring cells. The gif below shows the split for different kinds of orbs.

Orbs split when the number of orbs exceed the limit in any cell

During a split, if the neighboring cell contains orbs from the other player, then those orbs change their color to the current player. Such a split can be seen below.

Enemy orbs in the neighboring cells change color after a split happens in the current cell

A cell after a split increases orbs in its neighborhood. Thus, it can lead to further multiple splits, starting a chain reaction of splits, which is where the game gets its name from. Long chain reactions after just a single move is what makes this game unpredictable towards the end.

Now we know how the game proceeds from one state to the next state, there can be splits; or there can just be an increment of orbs in a single cell. But how does a player win? The goal of the game is straightforward, a player has to eliminate all its enemy orbs from the board.

A player wins when all enemy orbs are eliminated from the game

State of the game

What information do we need to store to capture any state of this game? We need two things — firstly, an M x N array which tells us the number and type of orbs in each of the M*N cells on the board, and secondly, whose turn it is, ‘red’ or ‘green’. We can use positive numbers to represent number of red orbs and negative numbers to represent number of green orbs. The following image shows an example of how to represent a board state.

State Representation

State Transition

Now that we know how to represent a state, we want to focus on a more important question, how can we get the next state given the current state. To get the next state, we need to know the cell where the player has clicked. Let’s call that cell as the event cell.

We’ll do some processing on the event cell, which’ll look something like this. We’ll add an orb to it, and check if the number of orbs exceed the cell’s limit. If the orbs exceed in number, then we need to split the orbs.

In case of a split, each of the neighbors of the event cell would get one orb, and then we would process those neighbors, and so on. We observe that we are first processing the event cell, then the neighbors of the event cell, then the neighbors of the neighbors of the event cell and so on. The neighbors at some level i, can be processed in any order; the end result of processing all the neighbors at level i in any order would be the same. An example of this can be seen in the figure below.

Processing cells at the same level in two different ways give the same final state. The reason why the processing order at level i doesn’t matter is because, there are two types of cells at level i, those which split and those which don’t. Those cells which don’t split, their orb count just gets increased by one irrespective of the processing order. And those cells that split, they only add orbs to the cells at level i+1. Now, here is the fact that holds the whole argument, the sets of cells at level i and level i+1 are always disjoint, therefore the sum of those additions taken over all the cells at level i is always the same for each cell at level i+1.

So, what we are essentially doing is a breadth-first traversal, which is why we would implement state transition with the help of a queue.

Implementation

Some important components for implementing the Chain Reaction Game

State

Implementing state representation is somewhat straightforward. We store the board info as number of orbs and color of orbs in different numpy arrays. The state representation also includes the player’s turn.

Gist showing code for class State

Visualizer

And here is some code to draw the grid and orbs using rectangles and circles respectively.

Gist showing code for class Visualizer

Controller

Here we will look at the most important code segment in controller, which is the state transition, i.e., getting the next state, given the current state and an event.

Gist showing code for state transition

There can be a condition, in which the orbs keep on splitting, and the other player’s orbs vanish, as shown in the figure below.

The orbs keep on splitting, and green has won the game

So, we need to check if a player has won the game inside the breadth first traversal while loop. We have checked it by keeping track of red and green orbs counts (as myorbs and opporbs), and updating them with each iteration of the loop.

Gist showing code to track counts of orbs of both the current player and the opponent

AlphaZero

So what’s really happening in AlphaZero

[5] would be a good starting point to understand AlphaZero. There are two kinds of thinking involved in human reasoning [6] — a slow and a fast mode of thinking. The fast mode is guided by intuition and the slow mode is guided by explicitly following some rules or steps like that of a traditional computer algorithm.

In AlphaZero, the fast mode or the intuition is implemented through a neural network that takes a board state and outputs a policy (a probability distribution over actions) and a value (a score that tells how good is the given board state for the current player); and the slow mode of thinking is implemented through Monte Carlo Tree Search.

Imagine that we’re thinking what to play in a particular board state for any perfect information game (here we’re taking the example of Chain Reaction). We might have some intuition about which actions are better and which are not. This original intuition can be expressed as a probability distribution over actions. We’ll assign a higher probability to good actions and lower to bad actions (good actions being those that give us a winning position). If we don’t have any such intuition, then we can just start with a uniform probability distribution.

This probability distribution over the actions is our “policy” for the given state.

A cycle of Amplification and Distillation can build expert agents

There is a way to improve this original policy — by thinking of possible future moves ahead in time. Starting from our current state, we can think about what moves we can play and what moves our opponent can play and so on. So we are essentially talking about a tree search. This tree search can be improved by using our original intuition to evaluate the intermediate board states (getting values), and probably not spending a lot of time in exploring nodes which have low value. After doing this tree search, we will have a better idea of what to play in the current board state, in other words, we get an improved policy. This entire process is known as amplification. This is done using Monte Carlo Tree Search in AlphaZero.

Distillation is using this improved policy (obtained through the slow process) to improve our fast mode of thinking. In the case of AlphaZero, this means optimizing the neural network to minimize the cross entropy loss between the improved policy and the original policy. There would also be another loss term between value predictions of the neural network and actual value obtained at the end of a game. We will discuss this in implementation-level detail later.

Written by Aditya Rastogi. Thanks to Aayush Pandey for the help. The figures have been drawn using Excalidraw. Images for code snippets have been generated using Carbon.

References

[1] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2018). A general reinforcement learning algorithm that Masters Chess, Shogi, and go through self-play. Science, 362(6419), 1140–1144. https://doi.org/10.1126/science.aar6404

[2] https://brilliant.org/wiki/chain-reaction-game/

[3] Wu, D. J. [PDF] accelerating self-play learning in go: Semantic scholar. https://www.semanticscholar.org/paper/Accelerating-Self-Play-Learning-in-Go-Wu/f244ffb549a61806d00f614e70fa1c3fbe5fffc6

[4] https://medium.com/oracledevs/lessons-from-implementing-alphazero-7e36e9054191

[5] “How to Keep Improving When You’re Better Than Any Teacher — Iterated Distillation and Amplification.” YouTube, uploaded by Robert Miles, 3 Nov. 2019, https://www.youtube.com/watch?v=v9M2Ho9I9Qo.

[6] Anthony, T., Barber, D., & Tia, Z. Thinking fast and slow with deep learning and Tree Search. https://arxiv.org/pdf/1705.08439.pdf

--

--