Part 3 — Tabular Q Learning, a Tic Tac Toe player that gets better and better

11 min readJun 6, 2018

In this part, we will introduce our first player which actually uses a machine learning approach to playing Tic Tac Toe. The machine learning approach we will use is called Reinforcement Learning, and the particular variant we will use is called Tabular Q Learning. In the following we will introduce all 3 concepts, Reinforcement Learning, Q function, and Tabular Q function, and then put them all together to create a Tabular Q-Learning Tic Tac Toe player.

What is Reinforcement learning?

The basic idea behind reinforcement learning is quite straight forward: Learn by doing and observing the results of your actions. This concept is motivated by how humans seem to learn in many cases: We do something and if it had positive consequences we are more likely to do it again, whereas if it had negative consequences we are generally less likely to do it again.

At each step, the player (also interchangeably called agent) takes a good look at the current environment and makes an educated guess of what it thinks is the best action in the given situation. It executes the selected action and observes how that action has changed the environment. It also receives feedback on how good or bad its action was in that particular situation: If it chose a good action it will get a positive reward, if it chose a bad action it will get a negative reward. Based on that feedback, it can then adjust its educated guessing process. When it encounters the same situation again in the future, it should hopefully be more more likely to chose the same action again if the feedback was positive and more likely to chose a different action if it was negative.

The following graphic illustrates this process:

In many situations, the feedback / reward that the agent gets after choosing an action will not necessarily be very strong or conclusive. E.g. when making a move at the beginning of a game of Tic Tac Toe, it will be unclear if the chosen action was good or bad and thus the reward will be neutral.

In fact, going forward we will only ever return a positive or negative reward for the final move in a game. If the player won, the last move will get a positive reward, if it lost it will get a negative reward, and in the case of a draw it will get no reward.

One fundamental challenge of Reinforcement Learning is to have a process of attributing that final reward back to previous actions to reflect the fact that those previous moves likely contributed to the ultimate success or failure in some way. Based on the assumption, that the longer in the past an action has happened, the less likely it is that it was instrumental in the final outcome, we will discount the final reward at each step back in time.

Let’s look at an example: Let’s assume we have played a game of Tic Tac Toe, choosing moves based on our current best guesses and won. None of our moves got a positive or negative reward apart from the very last one which got a 1 as it won the game. Now, it is more than likely that the victory was due not only because of the quality of the last move, but because at least some of our previous moves were good moves as well. The further in the past, the less likely it is that they contributed significantly to the outcome. So, we might give the second last move a discounted reward of 0.8, the one before that a reward of 0.5, the one before that a reward of 0.1 etc.

Over many games, the theory goes, that if there are moves that are particularly strong and important at the beginning of the game they will still end up with a very high likelihood of getting chosen as they will receive positive rewards over and over again, adding up over time despite the fact that they only ever received highly discounted rewards. If I remember correctly, with certain constraints on the set-up and number of iterations this can be proven to be true.

Side Note: A more detailed introduction to Reinforcement Learning can be found on Wikipedia and the ultimate, authoritative introduction is probably the book Reinforcement Learning — An Introduction by Richard S. Sutton and Andrew G. Barto. Draft versions of this book seem to be easily find-able and downloadable over the internet, but as I’m not sure about the legality of those I will not provide links here. In the January 2018 Draft version, the tabular Q-learning approach from this tutorial can be found in part 1, chapter 6.5 (“ Part 1: Tabular Solution Methods -> 6 Temporal Difference Learning -> 6.5 Q-Learning: Off-policy TD Control”).
I learned most of what I know about implementing reinforcement learning with TensorFlow from the excellent tutorials by Arthur Juliani.

What is the Q function?

The next concept we need to understand is the Q function. The Q stands for Quality and the Q function is the function that assigns a quality score to an State — Action pair. I.e. given a state S and an action A the function Q(S,A)↦IR will return a real number reflecting the quality of doing this action A in the state S.

We will assume that this function exists and is deterministic, but its concrete mapping of states and actions to quality values is not necessarily known to us. Q-learning is all about learning this mapping and thus the function Q.

If you think back to our previous part about the Min-Max Algorithm, you might remember that the values we computed there for each state seem very similar to the values that the Q function returns. And indeed they are related. In the Min Max algorithm, we assign quality values to states, whereas here we assign quality values to actions that can be chosen in a given state.

In a deterministic environment like playing Tic Tac Toe against an opponent that always plays the best possible move, the value of a particular action in a particular state is indeed closely reflecting the value of the following state. When playing against a Random Player, however, this is not necessarily the case anymore. Depending on which move the randomly playing opponent chooses in reaction to our our action, we will end up in different next states with different Q values. The Q value of the action will thus be the combinations of the values of all possible next states and their respective probabilities of occurring. This makes the Q function dependent on the opponent. If the opponent is an ideal player, always choosing the best possible move, the Q function is indeed again closely reflecting the Min Max scores.

What is the tabular Q function?

Tabular in this context simply means that we will store the Q function in a lookup table. I.e. we create a table where we store the Q value for each possible State and Move. It may look something like this (the values in this example are completely made up and not the real Q function for Tic Tac Toe):

Side Note: In the code we will obviously not use a picture of the board as key to the table. We will use a unique integer hash for the board instead. The class Board contains the method def hash_value(self) -> int: that returns such a unique hash value.

How Do we use the Q function to play Tic Tac Toe?

Assuming we know the Q function, we can play Tic Tac Toe, by looking up the Q values for all possible moves in the current situation and chose the one with the highest value. If there is more than one possible move with the same highest value, we chose randomly amongst them. Having the highest value means that this move is the best move in this situation.

Going forward we will also use the term Policy to refer to strategies such as this.

How do we compute the actual Q function fro Tic Tac Toe?

This is where we put all the above together:

First we create a Q table with a row for every possible state and initialise the values to be the same for every state. I.e. all moves are equally likely to be picked.

Now we use Reinforcement Learning to learn the real Q function. We do this by repeatedly playing Tic Tac Toe against an opponent and at the end of each game update the Q value of all moves in the game according to the game result. For a win we will award a reward of 1 to the last move, for a loss a reward of 0 and for a draw we will give a reward of 0.5.

The final move will get the reward as its new Q value. For all the other moves in that game we will use the following formula to update their Q values:

Q(S,A)= Q(S,A)+α∗(γ∗maxaQ(S′,a)− Q(S,A))

with S being the current state, A the current action, S′ the state after doing A, α being the learning rate, γ being the discount factor, and maxaQ(S′,a) the highest Q value of any move in the next state S′, i.e. the Q value of the best move in the following state.

We can also rewrite this formula as:

Q(S,A)=(1−α)∗Q(S,A)+α∗γ∗maxaQ(S′,a)

which maybe makes it easier to see what the learning rate does. It is basically determining how much we change our current Q value towards the discounted maximum of its successors. If we would chose α as 0, we wouldn’t change anything, if we chose α as 1 we completely replace the old value with the new value, if we chose α as 0.5, we would get the average between the old value and the new value. In the computer player below we will chose α to be 0.9 and γ to be 0.95.

Side Note: A short detour to explain this formula step by step:
Let’s start with the inner-most bit:
Q(S,A)=maxaQ(S′,a)
This basically says: If we do action A in state S we will get to state S′ and the value of doing action A in state S should be the same as the value of the best possible action in that next state S′.
This is seen as being a bit too optimistic, so we want to actually say something more like: The value of doing action A in state S should be almost as good as the value of the best possible action in that next state S′. I.e. we need to apply a discount to that value. This will also reflect the fact that the longer back an action was in time, the less crucial it likely was for the end result. We achieve this by multiplying with a discount value γ with 0<γ<1:
Q(S,A)=γ∗maxaQ(S′,a)
Finally we don’t want to completely replace the old Q-value of our action with each update. What if we just got lucky and out opponent made a mistake, i.e. we won, but the action wasn’t actually that good? In order to account for this, we set the new Q value to a value somewhere between the old Q value and the potential new Q value. We use an additional parameter, the learning rate α to set how much the new value will change the old value and get the final formula as above:
Q(S,A)=(1−α)∗Q(S,A)+α∗γ∗maxaQ(S′,a)

By repeatedly playing game after game and updating the Q values after each game, the Q function will converge to the true Q values for Tic Tac Toe.

How to initialise the Q Table?

The one thing we haven’t discussed yet is what value we chose to initialise the Q table with. This value does actually matter.

Pessimistic Initialisation

If we initialise all values with 0, the player will be very pessimistic, i.e. it will assume every move leads to a loss, and is likely to settle for anything that is better than losing. E.g. if the player achieves a draw before its first win, it will increase the values of the moves that lead to the draw while all other move values will still be 0. In subsequent games it will favour using those moves again over trying something new with the potential chance of actually winning.

Optimistic Initialisation

On the other hand, if we initialise the values to 1, the player will be very optimistic and expect every move to lead to victory. The player is thus less likely to settle for a draw as best possible outcome and will actively explore other options first. The downside however is that it will learn quite slowly as it will exhaust every other option before settling on a strategy that leads to a draw.

Feel encouraged to try different values in the code below and compare performance. By default the player will initialise the values with 0.6, i.e. be slightly optimistic but not take too long to be dissuaded from a bad choice. In practice, as we know that a draw is the best outcome one can expect, a value of 0 should thus converge fastest.

Putting it all together

The class TQPlayer implements an agent playing Tic Tac Toe and learning its Q function on the way. Let’s pit it against some of the players we have previously created and see how it goes. We will have it play several battles of 100 games against the same player and see how its performance improves.

First we define a new utility function eval_players which takes 2 players and the number of battles as input, then executes the battles and prints a plot of the results. You can also optionally pass the number of games per battle.

Now we use this function to see how our players perform against each other:

Conclusions TQPlayer vs Min Max

Both TQPlayer and Min Max have a severe limitation: They are only feasible for games with a small number of states. The Min Max player needs to be able to traverse the tree of possible continuations which is not feasible if there are as many as there are in Chess or Go. The TQPlayer needs to store the Q values for every state, which is also not feasible for games like Chess or Go.

The Min Max player has the advantage that it plays perfectly from the start. No need to learn anything. The TQPlayer on the other hand needs to play a lot of games to play well.

The TQPlayer however is able to learn effective strategies against opponents which play less than perfect. Min Max always assumes the opponent will make the best possible move in every situation. If that’s not the case, the Min Max player is not able to exploit this flaw in its opponent. The TQPlayer however is more flexible and can learn an optimal strategy against any opponent.

Feel encouraged to experiment yourself with various different settings and see how it goes!

The other parts of this series can be found here:

The source code and Jupyter notebooks for all parts of this series are available at GitHub.