Analytics Vidhya
Published in

Analytics Vidhya

Thank You Sir

How I implemented the famous DeepMind algorithm for the Tic-Tac-Toe game

Photo by Damian Patkowski on Unsplash

So, I have nothing to do as my gpu is busy doing some calculations while I am thinking about my time spent with my undergrad supervisor after graduation, 9 months to be precise. I remember him telling me that I could write an article about my work which was a paper implementation I had done during that time.

It was 2017 and I had planned to start my MS but somehow it got delayed a year and I decided to go back to work on some project with my supervisor. Initially, I worked on several projects including Blind Textual Deblurring, Autonomous Driving in Car Simulations and Neural Steganography. After sometime my supervisor went for a conference and I read some of his papers along with his PhD thesis (some of it) while he was away. When he came back, I asked him to tell me an idea to work on and he told me to read about “Alpha GO”.

What follows is my memory of what I did after reading.

I got to know the following facts:

In March 2016, Deepmind’s AlphaGo beat world champion Go player Lee Sedol 4–1.

18th October 2017, AlphaGo Zero, that had defeated AlphaGo 100–0.

5th December 2017, DeepMind released another paper showing how AlphaGo Zero could be adapted to beat the world champion programs StockFish and Elmo at chess and shogi.

An algorithm for getting good at something without any prior knowledge of human expert strategy was born.

I was so amazed by it that I decided to implement it myself and my supervisor provided me an article which was about implementing the same algorithm on Connect4 but I was unable to grasp anything so I decided to start from scratch and implement it on (3x3) Tic Tac Toe as it was easier to implement the game’s logic and in less time I could check my progress. Following articles really helped me throughout the process.

After going through many iterations of the articles along with the papers [1,2] from DeepMind, I got the following understanding of the algorithm.


Generate training data using the current best player through self play to train the second best and then evaluate it’s performance against the best and if it wins 55% of time then replace it with the best for the next iteration.


The goal is to predict two things:

At each point, which move to take which is actually learning the policy.

At each point, what is the immediate reward which is actually learning the value function.


Self Play

Network Weights Optimization

Network Evaluation

What follows is my translation of the building blocks required to implement the algorithm. How the world should look like? How to represent different stages? How to store data?


Move Selection through Monte Carlo Tree Search

At each move, the following information is stored :

The game state

The search probabilities

The winner


0 represents no move has been played at that position

1 represents player 1

2 represents player 2


At any instance, the condition of the game is a state depending on the representation.

Here, it is the complete maze:



Along with a turn indicator

( Above shown is the initial state of the game under consideration )


Possible moves from a state are considered as its children.

Children of initial state of the game are :










A child stores five values

N → number of times the action has been taken

W → total value of the state

Q → mean value of the state

P → policy function prediction

V → value function prediction


Given a state, explore the tree until a leaf node is reached by selecting the best child.

At leaf node, predict probabilities of each of it’s children.

Back propagate from the leaf to the root which in this case is the input and update the values of N, W and Q.

Repeat this cycle for a certain time period.

Select the best action among the lot.


A child is selected based on the equation Q+U

Q → The mean value of the state

U → A function of P and N that increases if an action hasn’t been explored much, relative to the other actions, or if the prior probability of the action is high

U in this case is P/N

So the equation becomes (P/N)+Q

Because early on in the simulation, U should dominate (more exploration), but later Q is more important (exploitation)

So a slight modification is needed which is (epsilon*(P/N))+Q


During back propagation, the values are updated as follows:

N = N+1

W = W+V

Q = W/N


A game played from start to finish is an episode.


Once an episode has ended, all the steps which were performed will be labelled.

All the steps taken by the winning player will be assigned +1 and to all other steps -1 will be assigned.

Similarly every move will be assigned a probability provided by a CNN.


It’s a network with 1 convolution layer and 40 residual layers with a double headed output, a value head and a policy head.

Value head giving a single prediction in the range [-1,1], actually trying to predict the outcome of the game.

Policy head giving 9 probabilities, actually trying to predict the probability distribution among the children.

For faster convergence, I divided the model in to one each for the different heads.

The loss function in case of value head is mean square error while in case of policy head, it is softmax crossentropy.


Once data has been generated using the current best player through self play, now it’s time to retrain the second best using the data generated by the best.

Sample a mini batch from the data.

Retrain the current network on these positions.


After training, it’s time to evaluate the performance of the retrained network against the best so far and if the retrained network wins 55% of the games, it will become the new best and for the next iteration, roles will be changed as the new best will generate data for training of the previous best.


Number of training epochs : 1

Number of training iterations : 20

Number of games during self play : 100

Number of games during evaluation : 40

Batch Size : 256

Sample Size : 512

Monte Carlo Simulations : 100

Regularization Constant : 0.0001

Learning Rate : 0.01

Momentum : 0.9

Optimizer : SGD


After 100 games, defeated untrained algorithm with 40–0

After 200 games, defeated first best with 40–0

So far I have tried to explain the overall working of the algorithm and now I will try to explain my code structure.

Node Class

Node Class contains the following attributes

n is the number of times an action has been taken

w is the total value of the next state

q is the mean value of the next state

p is the prior probability of selecting action a

state contains the current state of the game

actions contain the possible actions from current state

p contains the move probabilities

v is the value of state (for the current player)



It returns the actions possible from current state in the form of indices.


It returns the updated node with child array containing possible actions in the form of states and parent containing the current state.


It returns 1 if player wins, 2 if opponent wins, -1 if game has not finished, 0 if game has drawn.


It performs a complete Monte Carlo simulation and returns move probabilities.


It performs Monte Carlo Tree Search by performing multiple simulations and returns best move, current_move_probabilities.


It prints the output in the form of a board.


It performs a complete episode of the game using Monte Carlo Tree Search and returns the predictions from pi_model providing move probabilities, z_model providing value of the state and states providing different states during an episode with an additional value of 1 if it’s player 1’s move and 2 if it’s player 2’s move.


Code is available in this repository.

So, what’s next? Well, I am working on cyclist detection and have implemented another paper as part of the process. I hope to write about it soon.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store