Part 7 — This is deep. In a convoluted way.

5 min readJun 6, 2018

One trick that we haven’t tried yet is the use of Convolutional Neural Network (CNN) layers.

So far, we have always fed the board state into the Neural Network encoded as a 1D array. This makes it hard for the Neural Network to exploit the inherently 2D nature of a Tic Tac Toe board though.

CNN layers can slide a 2D window over an 2D input raster (+ an additional dimension for different channels, e.g. red, green, and blue colour channels in an image) and thus learn about inherently 2 dimensional patterns in the input:

Source: Daphne Cornelisse “An intuitive guide to Convolutional Neural Networks”

And indeed, many contemporary Neural Network approaches to learning how to play board games use CNN layers as part of their network topology. The most prominent example probably being Alpha Go by Deep Mind.

If it’s good enough for them, it should be good enough for us. We’ll take the Neural Network from our previous part and simply pop a few CNN layers on top:

We’ll also have to change our input state encoding. The 2D part will be the 3x3 Tic Tac Toe board, and the channels will be the positions for CROSS, NAUGHT, and EMPTY.

I.e. our input will go from:

The transpose at the end is necessary to switch the axes around and make the plane index be the last index. It seems that TensorFlow likes it better this way — or in the case of running on a CPU actually insists. I.e., in theory conv2d has a parameter data_format where we could specify a different position for the channel index, in practice however, unless we run on a GPU, this will result in an "Not Implemented Exception".

By adding the conv2d layers we now have a Convolutional Neural Network. Since this also increases the depth of our network to more than 1 hidden layer, we can now also call it a Deep Network.

Some other changes while we are at it

There are two more smaller changes we make at this stage. We add regularization loss and change some of the hyper-parameters a bit.

Regularization Loss

One problem Neural Networks can face is that weights get increasingly large. This in turn can cause numerical instabilities. A straight forward approach to dealing with this, is to define a loss based on how large the weights are and add this to the total loss function of the Neural Network. This gives the training step an incentive to keep weights small.

We need to balance the regularization loss with the rest of the loss function however to avoid the Neural Network optimizing the regularization loss at the expense of a higher loss in our Q function. We don’t want this.

In the code this means, we add a kernel_regularizer to every layer, e.g. like this:

We then add the regularization loss to the total loss:

with self.beta being a factor smaller than 1 to scale down reg_loss and prevent it from dominating the optimisation process.

Some changes in hyper parameters

I have also changed some of the hyper-parameter:

In particular, we do random moves for longer (with a reasonable chance of happening for the first 10,000 games), discount the reward less, and change the win and loss rewards by a factor of 10.

The different reward is a puzzling one. As far as I understand the theory 1 to -1 should work better than 10 to -10. In our example however I found 10 to -10 to work much better. I’m not quite sure why that is.

Let’s play

Time to play some games. The player is implemented in DeepExpDoubleDuelQPlayer.

Yeah! I ran this quite a couple of time and it doesn’t always get to 100% in the end, but more often than not it does. Even when it does not it get’s quite close.

I think we’ve got a wrap!

Conclusion

How hard can it be to train a Neural Network to play Tic Tac Toe well? As we found out, harder than one would think. While even simple Neural Networks seems to be able to play better than random, mastering the game seems to be quite challenging for a Neural Net and we had to employ quite a large set of tricks and techniques to get there in the end.

Now, unfortunately, I wasn’t particularly patient and methodical. Instead of making one small change after the other we often added quite a few at a time. So we don’t really know what did the trick in the end, and maybe most of what we did wasn’t actually that important. I did however quite a bit of experimenting with other configurations, and I couldn’t find any short-cuts that would produce reliable results. Feel encouraged to try yourself and report back if you have more luck.

The other thing is the huge number of hyper-parameters we ended up with in the end:

Input (Game state) Encoding
Learning rate
Reward discount (gamma)
Rewards for winning, losing, and draw
Initial random move probability
Decrease rate of random move probability
Batch-size for experience replay
Number of pre-training games
Target network update rate (tau)
Regularization loss scale factor (beta)
Number of CNN layers
Kernel size and number of filters for each CNN layer
Number and size of fully connected layers
Activation functions
Optimizer method

All of these can be tuned, often with significant, and sometimes with surprising effect on the outcome. For each choice of hyper-parameters we need to run many iterations to get statistically significant results. Each run takes a significant amount of time.

Other things to try

I think I’m going to leave it there, ending on a high note. There are however some other approaches that one could try:

Instead of learning the Q function, one could try to have the Neural Network learn the policy directly, similar to the approach in Simple Reinforcement Learning with Tensorflow: Part 2 — Policy-based Agents.

One could also try an Asynchronous Actor-Critic Agents (A3C) approach which has good chances of working even better than our Deep Q Learning approach.

I might give this a try sometime in the future, or if you want to have a go at it, let us know how you went.

The other parts of this series can be found here:

The source code and Jupyter notebooks for all parts of this series are available at GitHub.