Thank You Sir
How I implemented the famous DeepMind algorithm for the Tic-Tac-Toe game
So, I have nothing to do as my gpu is busy doing some calculations while I am thinking about my time spent with my undergrad supervisor after graduation, 9 months to be precise. I remember him telling me that I could write an article about my work which was a paper implementation I had done during that time.
It was 2017 and I had planned to start my MS but somehow it got delayed a year and I decided to go back to work on some project with my supervisor. Initially, I worked on several projects including Blind Textual Deblurring, Autonomous Driving in Car Simulations and Neural Steganography. After sometime my supervisor went for a conference and I read some of his papers along with his PhD thesis (some of it) while he was away. When he came back, I asked him to tell me an idea to work on and he told me to read about “Alpha GO”.
What follows is my memory of what I did after reading.
I got to know the following facts:
In March 2016, Deepmind’s AlphaGo beat world champion Go player Lee Sedol 4–1.
18th October 2017, AlphaGo Zero, that had defeated AlphaGo 100–0.
5th December 2017, DeepMind released another paper showing how AlphaGo Zero could be adapted to beat the world champion programs StockFish and Elmo at chess and shogi.
An algorithm for getting good at something without any prior knowledge of human expert strategy was born.
I was so amazed by it that I decided to implement it myself and my supervisor provided me an article which was about implementing the same algorithm on Connect4 but I was unable to grasp anything so I decided to start from scratch and implement it on (3x3) Tic Tac Toe as it was easier to implement the game’s logic and in less time I could check my progress. Following articles really helped me throughout the process.
After going through many iterations of the articles along with the papers [1,2] from DeepMind, I got the following understanding of the algorithm.
OVERVIEW
Generate training data using the current best player through self play to train the second best and then evaluate it’s performance against the best and if it wins 55% of time then replace it with the best for the next iteration.
GOAL
The goal is to predict two things:
At each point, which move to take which is actually learning the policy.
At each point, what is the immediate reward which is actually learning the value function.
METHODOLOGY
Self Play
Network Weights Optimization
Network Evaluation
What follows is my translation of the building blocks required to implement the algorithm. How the world should look like? How to represent different stages? How to store data?
SELF PLAY
Move Selection through Monte Carlo Tree Search
At each move, the following information is stored :
The game state
The search probabilities
The winner
REPRESENTATION
0 represents no move has been played at that position
1 represents player 1
2 represents player 2
STATE
At any instance, the condition of the game is a state depending on the representation.
Here, it is the complete maze:
0000000001
0000000002
Along with a turn indicator
( Above shown is the initial state of the game under consideration )
CHILDREN
Possible moves from a state are considered as its children.
Children of initial state of the game are :
100000000
010000000
001000000
000100000
000010000
000001000
000000100
000000010
000000001
A child stores five values
N → number of times the action has been taken
W → total value of the state
Q → mean value of the state
P → policy function prediction
V → value function prediction
MONTE CARLO TREE SEARCH
Given a state, explore the tree until a leaf node is reached by selecting the best child.
At leaf node, predict probabilities of each of it’s children.
Back propagate from the leaf to the root which in this case is the input and update the values of N, W and Q.
Repeat this cycle for a certain time period.
Select the best action among the lot.
CHILD SELECTION
A child is selected based on the equation Q+U
Q → The mean value of the state
U → A function of P and N that increases if an action hasn’t been explored much, relative to the other actions, or if the prior probability of the action is high
U in this case is P/N
So the equation becomes (P/N)+Q
Because early on in the simulation, U should dominate (more exploration), but later Q is more important (exploitation)
So a slight modification is needed which is (epsilon*(P/N))+Q
VALUE UPDATE
During back propagation, the values are updated as follows:
N = N+1
W = W+V
Q = W/N
EPISODE
A game played from start to finish is an episode.
DATA GENERATION
Once an episode has ended, all the steps which were performed will be labelled.
All the steps taken by the winning player will be assigned +1 and to all other steps -1 will be assigned.
Similarly every move will be assigned a probability provided by a CNN.
DEEP NEURAL NETWORK
It’s a network with 1 convolution layer and 40 residual layers with a double headed output, a value head and a policy head.
Value head giving a single prediction in the range [-1,1], actually trying to predict the outcome of the game.
Policy head giving 9 probabilities, actually trying to predict the probability distribution among the children.
For faster convergence, I divided the model in to one each for the different heads.
The loss function in case of value head is mean square error while in case of policy head, it is softmax crossentropy.
NETWORK WEIGHTS OPTIMIZATION
Once data has been generated using the current best player through self play, now it’s time to retrain the second best using the data generated by the best.
Sample a mini batch from the data.
Retrain the current network on these positions.
NETWORK EVALUATION
After training, it’s time to evaluate the performance of the retrained network against the best so far and if the retrained network wins 55% of the games, it will become the new best and for the next iteration, roles will be changed as the new best will generate data for training of the previous best.
SOME HYPER PARAMETERS
Number of training epochs : 1
Number of training iterations : 20
Number of games during self play : 100
Number of games during evaluation : 40
Batch Size : 256
Sample Size : 512
Monte Carlo Simulations : 100
Regularization Constant : 0.0001
Learning Rate : 0.01
Momentum : 0.9
Optimizer : SGD
RESULTS
After 100 games, defeated untrained algorithm with 40–0
After 200 games, defeated first best with 40–0
So far I have tried to explain the overall working of the algorithm and now I will try to explain my code structure.
Node Class
Node Class contains the following attributes
n is the number of times an action has been taken
w is the total value of the next state
q is the mean value of the next state
p is the prior probability of selecting action a
state contains the current state of the game
actions contain the possible actions from current state
p contains the move probabilities
v is the value of state (for the current player)
FUNCTIONS
actions
It returns the actions possible from current state in the form of indices.
children
It returns the updated node with child array containing possible actions in the form of states and parent containing the current state.
conclusion
It returns 1 if player wins, 2 if opponent wins, -1 if game has not finished, 0 if game has drawn.
simulation
It performs a complete Monte Carlo simulation and returns move probabilities.
mcst
It performs Monte Carlo Tree Search by performing multiple simulations and returns best move, current_move_probabilities.
print_maze
It prints the output in the form of a board.
episode
It performs a complete episode of the game using Monte Carlo Tree Search and returns the predictions from pi_model providing move probabilities, z_model providing value of the state and states providing different states during an episode with an additional value of 1 if it’s player 1’s move and 2 if it’s player 2’s move.

Code is available in this repository.
So, what’s next? Well, I am working on cyclist detection and have implemented another paper as part of the process. I hope to write about it soon.