Part 1: AlphaZero implementation for the game Onitama

Nicolas Maurer
Analytics Vidhya
Published in
4 min readFeb 20, 2022

How can AlphaZero learn to play games? Implementation for the game Onitama!

  • Part 2: Soon
  • Part 3: Soon

In this article, we’ll learn about the game Onitama, and the main ideas of AlphaZero a deep reinforcement learning algorithm able to learn to play a game without human information!

My implementation of AlphaZero for the game Onitama can be found on my Github.

What is Onitama?

Onitama is a 2 player board game with perfect information and random start-up. The size of the board is 5x5, and each player starts the game with 5 pieces (4 pawns and a master pawn).

In the game, there is a total of 16 cards defining the movements possible for the pawns. At the start of the game 5 out of the 16 cards are drawn, 2 for each player and one remaining on the side of the board.

Source : https://www.ultraboardgames.com/onitama/game-rules.php

Turn by turn, each player can move their pawn accordingly to the moves available on their cards.

Source : https://www.ultraboardgames.com/onitama/game-rules.php

Once a card is choosed and the move is played, the card is swapped with the one on the side, then the next player begins its turn.

To win the game you have 2 options:

  • Capture your opponent’s Master Pawn
  • Move your Master Pawn to your opponent’s temple

Now that we know how to play this game, let’s see how AlphaZero Algorithm works!

AlphaZero Main Ideas: Neural Network and MCTS.

AlphaZero is a reinforcement learning algorithm learning from self play with no human information except for the rule of the game.

In 2016, DeepMind introduced AlphaGo, it is the first algorithm able to beat humans in the game of Go. In 2018, they released a new version AlphaGo Zero which surpassed AlphaGo by learning everything from blank except for the rules. A couple months later they generalized AlphaGo Zero to play other games such as chess or Shogi and they called it AlphaZero.

Let’s understand the main ideas of this incredible algorithm. This is my understanding of the amazing paper “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm” (silver et al. 2017)

AlphaZero learn everything by itself, except for the rules, but how?

AlphaZero is composed of two main parts, a Neural Network and a Monte Carlo Tree Search (MCTS). These two parts are used to generate self-play games, i.e. a training set. Then, the training set is used to train the neural network, and the cycle starts again.

To generate self-play, the Neural Network and the MCTS work together. The Neural Network is used to evaluate the position, the game state, and the MCTS performs N simulations to choose the next move.

Generate self-play games.

The Neural Network possesses two heads, one giving the value of the board, and one giving the output policy for all the possible moves. Then all the illegal moves are filtered out and the policy is renormalized to have a sum equal to 1.

Then the MCTS will choose the best move to play. To do so, the MCTS perfoms 4 steps:

  • Selection: From the root node, it successively selects the best child until it reaches a leaf node.
  • Expansion: Create child node for all the possible moves from the selected node.
  • Simulation: Usually the simulation consists of playing random moves until the game is over, but in our case, the Neural Network will predict the value of the board.
  • Backpropagation: The result of the simulation, is used to update information of parent nodes.
Source: Wikipedia

The Selection is the most important part of the MCTS, best children are successively selected using UCT (Upper Confidence Bound 1 applied to trees) formula. This is where the magic happens, with enough training AlphaZero will only consider the best moves.

The main difficulty is to choose the right equilibrium between exploration and exploitation.

In the paper, they mention that each MCTS perform 800 simulations. Moreover the tree is not reset between moves, only between games.

Training

During training, AlphaZero moves are selected in proportion to the root visit count, they added Dirichlet Noise to theses probabilities according to the average number of legal moves.

The Neural Network is trained to fit the value of the board and the policy of moves given by the MCTS. Over the time it’ll learn to avoid illegal and bad moves.

In conclusion:

AlphaZero is a deep reinforcement learning algorithm able to learn to play a lot of differents games through self-play and without prior human information! This is possible because of the combination of a Neural Network and a MCTS is working together.

Hope you liked the story !

That’s all! I hope you found this post interesting, don’t forget to follow me if you want to be notified for the next parts !

--

--

Nicolas Maurer
Analytics Vidhya

Master’s student in applied mathematics and statistics, I wish to share with you my passion for AI