Part 4 — Neural Network Q Learning, a Tic Tac Toe player that learns — kind of

11 min readJun 6, 2018

In the previous part we implemented a player that uses a table to learn the Q function. This worked quite well. In particular because Tic Tac Toe only has very few states and each state only has very few possible moves. For a more complicated game, such as Go or Chess, the tabular approach would not be feasible.

What, if we could write a program that mimics the behaviour of the Q function without actually having to store the exact value for every state and action? Obviously, after a bit of practice, humans are able to spot certain patterns in the game position and deduce general rules of how to react to them. Can we teach a program to do something similar?

Exactly this is the idea when using a Neural Network. Neural Networks are generally used to do one of the following two things:

Classify the input: E.g. if the input is a picture, the output could be what kind of animal is in the picture.
Mimic a complex function (also called regression): Given an input value for the complex function, correctly predict what the output would be.

We will use the second use-case and train a Neural Network to mimic the Q function. Ideally, using significantly less space than the tabular Q function does.

Time for some acknowledgements:
Most of what I know about reinforcement learning with TensorFlow, I got from the excellent tutorials by Arthur Juliani.
After that I read Reinforcement Learning — An Introduction by Richard S. Sutton and Andrew G. Barto for a more in-depth understanding of the topic.
I also found the tutorial Deep reinforcement learning, battleship by Jonathan Landy quite interesting and useful. The, at least to me, rather unorthodox choice to feed the reward into the network via the learning rate, however, I find somewhat puzzling. I sus1`pect mathematically it is somehow equivalent to feeding the reward in via the label, but exactly how this exactly works is not obvious to me.

Preparations

In order to execute the code in this notebook you will have to install TensorFlow. At the time of writing this, the instructions of how to do so can be found here. If the link should no longer work, a quick Google should be able to point you in the right direction.

When installing TensorFlow, you will have to decide between two options: Install with, or without GPU support.

If you do not have a modern GPU, this choice is simple: Install without GPU support.

If you do have a modern GPU, you can try to install the GPU version, but this is a much more complex and difficult process than installing the non-GPU version. In particular you will have to install all kinds of GPU libraries before you can install TensorFlow. Only do so if you are comfortable with complex software installations and if you are able to deal with the messy fallout that inevitably happens if you get stuck in the middle of the process and have to roll back what you have done so far. Should you succeed, however, the code in this and the following notebooks will run noticeably fast than without GPU support.

A short introduction to Artificial Neural Networks

Artificial Neural Networks are made up of Nodes. A Node takes 1 or more inputs. It combines the inputs linearly, i.e. it multiplies each input i_x with a dedicated weight w_x for that input and adds them all up. On top of that it adds another value, the so called bias. It then applies an activation function f_a to the result and sends it to one or more outputs O:

Side Note: The bias effectively turns the linear function of the node into an affine linear function, i.e. a line that does not necessarily run through the point (0,0). Without that, no matter what the trained weights say, the output to input 0 would always be 0. There are other ways to achieve that, and indeed you will find them in some books. E.g. by adding another input to every node with the fixed input value of 1. Mathematically this is just the same.

The activation function allows us to achieve non-linearity. Without it we would only be able to mimic linear functions. Also, without it, multi-layer networks would not make any sense. They would just be a combination of linear functions and as any Algebra textbook would tell us: a combination of linear function is no more than a linear function itself. I.e. we could just collapse all layers into one and do the whole computation more efficiently.

Nodes that have no connections coming in are called Input Nodes and nodes having no connections coming out are called Output Nodes. Nodes are arranged in layers, with the first layer consisting of Input Nodes and the last layer consisting of Output Nodes. The other layers are also often referred to as Hidden Layers as the user will only ever interact with the Input and Output layers.

A simple network with one hidden layer may look like this (source Wikipedia):

By setting the weights in the nodes just right the neural network can mimic other complex functions with the same number of input and output values.

The special things about an artificial neural network is, that it can be trained to learn those weights and biases. By giving the Neural Network feedback on how good its output was, it can adjust its weights and biases in a process called Gradient Descent. For networks with more than one hidden layer it also uses a process called Back-propagation to apply the gradient to layers that are further away from the output layer.

For a more detailed introduction to Artificial Neural Networks, see Wikipedia, or any number of other sources a quick Google will yield.

A short introduction to TensorFlow

TensorFlow is an Open Source Machine Learning framework from Google. It allows us to specify, train, and run Artificial Neural Networks at a very high level in Python. I will not give a detailed introduction in how it works, or how to use it here. I will comment the code that we use as I go, but if you get stuck and it all just doesn’t make any sense, please read some of the introduction resources at TensorFlow Get Started or similar TensorFlow tutorials and then come back here and give it another go.

A Neural Network to play Tic Tac Toe

To train a Neural Network to play Tic Tac Toe, we need to define the following things:

The Topology of the neural network, i.e. how do the input and output layers look like. How many hidden layers and how big?
A loss function. The loss function will take the output of the Neural Network and return a value indicating how good that output was.
A training part which will try to adjust the weights in the Neural Network as to minimize the loss function.

The basic Tic Tac Toe Q learning Graph

We will experiment with some different graphs, but the basic shape will always be as follows:

An input layer which takes a game state, i.e. the current board, as input.
One or more hidden layer.
An output layer which will output the Q value for all possible moves in that game state.
As loss function we will use Mean Squared Error which is a generic and popular loss function for regression, i.e. learning to mimic another function.
The input for the loss function will be the output of the Neural Network and our updated estimate of the Q function by applying the discounted rewards and maximum Q values of the next states. I.e. the loss will be the difference between the output of the Neural Network and our updated estimate of the Q function.
We will mostly use the Gradient Descent Optimizer for training — i.e. to adjust the weights in the Neural Network. There are other reasonable, or potentially even better, options as well. Feel free to experiment — report back how it went.

There are many different other ways how we could do this:
We could feed an action into the network together with the board state, and have a single output value indicating the value of this action.
We could also just have single output value encoding the value of the state and use that as a proxy for the State / Action pairs that lead to that state.
And many more.

The input layer

There are also many options how we can encode the input to the Neural Network. We could just have a single node and feed a unique hash value of the the board into it. Or we could feed an array into it with each element encoding the value of the piece on it as an integer.

Conventional wisdom however seems to be that Neural Networks work best in cases like this with binary arrays as input. Blindly trusting this wisdom, our input will be an array of 27 (= 3 * 9) bits with the first 9 bits set to 1 at the positions of the Crosses, the next 9 bits set to 1 at the position of the Naughts and the final 9 bits set to 1 at the empty positions:

I didn’t actually try any of the other options. Feel free to give it a go and let me know how it went. In particular if you find that one of the other options actually works better.

The output layer

The output layer will have 9 nodes / output values, one for each position of the board that we could play. We will interpret the value of a node as the Q value of the corresponding move.

We will be ignoring the fact that for a particular board state some positions would already be taken and no longer be an option. The player will deal with this when choosing a move and ignore illegal moves no matter what their Q values are. That is, we do not try to teach the Neural Network what moves are legal or not. Again, general advice you find is that this is the better approach — we have to be careful though to also ignore these moves when computing maxaQ(S′,a) for the Q Value update.

Time to look at the code

The first version of our Neural Network Q-Learning Player can be found in the file SimpleNNQPlayer.py.

Let’s look at some selected parts of the code. The file define two classes:

QNetwork which builds the TensorFlow graph.
NNQPlayer which implements the game playing logic and utilizes the QNetwork class to determine moves.

A Closer look at `QNetwork`

The class QNetwork has 2 important methods add_dense_layer and build_graph.

The Method add_dense_layer is a utility method which simply adds a new layer of size output_size with activation function activation_fn on top of the layer input_layer. Optionally it will also attach the name name to the layer.

The method build_graph is where the actual graph is built.

This method creates placeholders for the board input (self.input_positions) as well as the target Q values we will supply in training (self.target_input).

It then builds the TensorFlow graph starting with the input layer. It adds one hidden layer with 9 times the nodes of the input layer (BOARD_SIZE * 3), i.e. 243 nodes. The number of nodes was more or less arbitrarily chosen. Despite the fact that different values very likely will perform differently. Feel encouraged to experiment and report back.

It uses ReLu as activation function. There are other options you could try, e.g. tanh.

After that we add the output layer (self.qvalues) and attach to that our Mean Squared Error loss function and the training function based on the Gradient Descent Optimizer.

There is one more layer, self.probabilities, which converts the Q values to corresponding action probabilities using the Softmax function. If we chose an action according to its probability in self.probabilities we will chose actions with high Q values proportionally more often than those with low Q values. We can use this for exploration when we don't always necessarily want to make the move with the single highest Q value.

All in all this is a very simple network, with only one, small hidden layer:

Green Nodes are where we feed input into the Network and Orange Nodes are where we retrieve the results.

A closer look at `NNQPlayer`

This class implements the Player interface and provides the actual game playing logic. For each move, it feeds the current board state into its Neural Network and then plays the move with the highest Q value.

Actually, it plays the move with the highest Q value amongst all legal moves. In this implementation we do not try to teach the Network to recognise and discard illegal moves - we just ignore them when making a move.

While playing the game we record the data that we will later need to train the network. We record all board states, the Q values of all moves in those states, all moves that were made during the game, and the maximum Q value of all moves in the new state for every move (i.e. the Q value which we pass down with a discount).

After the game is over we update the Q value of every move according to the formula:

Q(S,A)=γ∗maxaQ(S′,a)

with γ being the discount factor and maxaQ(S′,a) the maximum Q value of all possible moves in state S′, with S′ being the state we ended up in after doing move A in state S.

We then feed the game states and our updated Q value estimates into the Gradient Descent Optimiser and ask it to change its weights such that the Q values get close to our updated estimates.

The Q values for a board state can be computed in get_probs:

The computation of the next move happens in move:

The new Q value estimates get computed in calculate_targets:

And the Neural Network gets trained in final_result:

Time to run the code

So, how well does this network perform? Let’s have it play a couple of games and see.

Summary

How did we go overall? Here is a summary of our results with this very simple Neural Network based player:

Player      |  NN Player 1st          |  NN Player 2nd  
==============================================================
Random      | Not bad but not perfect | Kind of but not really
Min Max     | Mixed — All or nothing  | Mixed — All or nothing
Rnd Min Max | Sometimes / Mixed       | Nope

Not as good as I would have hoped given that mastering Tic Tac Toe is not a particularly hard challenge — at least for a human, even of very young age. So What’s going on? Join us again next time when we to try to find out why we get these rather under-whelming results, and if we can find a way to build better networks which overcome these limitations.

The other parts of this series can be found here:

The source code and Jupyter notebooks for all parts of this series are available at GitHub.