Teaching agents to play tic-tac-toe using Reinforcement Learning

8 min readAug 12, 2021

First time I heard about reinforcement learning (RL) was in one of my neuroscience classes in grad school. Back then I remember being fascinated about the whole idea and today I am even more excited about recent developments in RL and its future. In this article, I am finally writing about my first exposure to RL, where I used Q-Learning (a simple RL algorithm) to teach two agents to play tic-tac-toe. First I will briefly introduce RL, then will talk about how the algorithm learns the best moves to play and finally the results I obtained.

Reinforcement Learning

Reinforcement Learning (RL) is an algorithm inspired by how animals on earth learn. The key concept of RL is very simple and we use it in almost every aspect of our lives. Imagine you have never had a cup of coffee and one day you walk past Starbucks and decide to buy one. After drinking the coffee you realize that it was tastes good and also it helps you to stay focused all day. Both of these benefits were rewarding to you and now the chance of you buying a cup of coffee again next day is high. Now let’s try to connect this example to the concepts of RL. Initially you (the agent) have not had coffee before (agent’s state at the beginning). Then you decide to buy a cup of coffee (agent’s action), after drinking the coffee you realize it was tasty and it gave you extra energy (the reward). So the strength of the reward you received will now influence your future behavior and actions. If your action had a positive reward, you will tend to take that action more frequently in the future, whereas if the action had a negative reward, you will tend to avoid that in the future.

In the machine learning world, RL is a branch of of learning algorithms where an agent is trying to maximize rewards in a given environment. Most of the exciting applications of RL has been in computer games. For example Alphabets’ AlphaGo Zero learnt to play Go, one of the most complex board games in the world, from scratch using RL. Another one of my favorite examples is multi-agent hide and seek. In this simulation published by OpenAI, these agents learn how to play hide and seek using RL. It is amazing to see the strategies they come up with eventually. Other than computer games, there are several real world applications of RL, such as advertisement recommendation systems and financial trading.

Teaching agents to play tic-tac-toe

Now let’s talk about how we can use this concept to teach an agent to play tic-tac-toe. First we need to define what is a state ‘s’ in the game. Here in my version of the algorithm, I defined the state by a list of all the moves played by both agents at the end of the game. Let’s take the following example board configuration at the end of the game.

Example board configuration at the end of a game

In this example, the player 1 is ‘O’ and player 2 is ‘X’. The end state of the game is [1, 4, 5, 8, 9] and the winner is player 1. (Note that the numbers represent each cell of the board. 1 starting from the top left and 9 ending from bottom right) So, at the end of this game, player 1’s state list will be rewarded and player 2’s state list will be punished. So how do we give rewards and punishments to each player’s actions at the end of the game? To do this, we can use the Q-Function.

Q-Learning and the Q-Function

Q-Learning is a RL algorithm that learns the value of an action of a certain state. In order to learn the value of each state, it uses the Q-function.

Mathematically, this can be expressed as,

Q(s, a) = Q(s, a) + α * [ R(s, a) + γ * max Q(s’, a’) - Q(s, a) ]

Q(s, a) - The value of state ‘s’ when you pick an action ‘a’. For example if the current state of the board is [1, 2] (Player 1 has played in the first cell and player 2 has played in the second cell) and let’s say player 1 decides to play cell 9 in the board, then Q(s, a) is the value assigned for playing cell 9 while you are in state [1, 2].
R(s, a) - The reward/punishment given by the result at the end of the game. In the case of tic-tac-toe the reward/punishment is a win or loss at the end of the game. For example, we can give a reward of +1 for winning, -1 for losing and 0.5 for a draw.
max Q(s’, a’) - Here we are looking at our future states and picking the best (or the state action pair with the maximum value) state action pair. For instance let’s assume we have our current state as [1, 2] and we have already played [1, 2, 4] and [1, 2, 9] in our previous games and these states have values 3 and 4 respectively. Since the [1, 2, 9] has a value of 4 and it is the max value out of all the possible future states the Q(s’, a’) value here will be assigned to 4.
γ and α - These are two constants of the function which we use to fine tune the algorithm. (In the ML world we also call these hyperparameters). γ is called the discount rate and α is called the learning rate. The discount is used to balance immediate and future reward. This value takes a number between 0 and 1. Learning rate is another parameter which represents how well you want the algorithm to learn from new values vs old values. This parameter is also takes a number between 0 and 1.

So basically, the Q-Function is responsible for giving each state a value after each game. Once the agent play thousands of iterations of tic-tac-toe these states converge to their optimal values and the agent learns the best moves to play.

Results after 10,000 iterations of self play…….

In order to train the algorithm, we can play the Q-Learning algorithm with other agents. In my project, I trained the algorithm with a random agent as well as another Q-Learning agent. Now let’s dive into the results. (Click here if you’d like to go to the code)

First off, let’s see what happens when two random agents play. In this case, both players are randomly choosing their moves. Ideally we should see that the first player has an advantage over the second player because of the nature of the game tic-tac-toe.

Here are the results after 10,000 iterations when two random players compete with each other. As we expected the Player ‘O’, who is always the first player, has a higher chance of winning. The winning percentages are,

Player O winning percentage (Random agent) - 59%
Player X winning percentage (Random agent) - 29%
Percentage of draw- 12%

Now let’s see what happens when we replace Player O with a RL agent. Since the agent is learning at the beginning, in the very first few hundred games we could see that both agents are having similar number of wins. However, after multiple iterations, we’ll see that the RL agent is the clear cut winner.

As we expected the RL agent is winning almost all the games towards the end. Let’s look at the percentages,

Player O winning percentage (RL agent) - 88%
Player X winning percentage (Random agent) - 7%
Percentage of draw- 5%

We can check if our algorithm is really learning by making the RL agent play second. If we look at the results of the two random agents, we saw that the first player had a clear advantage. So if we let the RL agent play second, it should learn and beat the random agent after a few hundreds of iterations. Let’s see what happens when we let the RL agent play second.

Like we expected the RL agent, after about 3000 iterations, learns to master the game over the random agent. Since the random agent has an advantage by playing first, we see that the winning percentage is higher this time. However, eventually the RL agent starts being the clear winner. Below are the percentages.

Player O winning percentage (Random agent) - 28%
Player X winning percentage (RL agent) - 58%
Percentage of draw - 14%

Now let’s see the final scenario where an RL agent plays an RL agent. In this case, I would expect both players to draw most of the games since they are both learning to maximize reward. In this scenario, since the agents are trying to maximize reward, it’s more beneficial to draw the game than try to win. So we might see more draws in this case.

As we expected, in this scenario, the draw rate is higher. Also we can see that player ‘O’ still has a slight advantage over player ‘X’ by playing first. The winning percentages for each player is below.

Player O winning percentage (RL agent) - 13%
Player X winning percentage (RL agent) - 4%
Percentage of draw - 83%

I also found the reward distribution in this scenario interesting as well. (Note that I used a reward of +1 for a win, -1 for a loss and 0.5 for a draw).

If we look at the reward distribution we can see that both players have states that are close to zero. This could be because most of the states played in the game were drawn games.

Conclusion

Like I said earlier, this was one of the my first exposures to Reinforcement Learning (RL) and it was very enjoyable! We talked about RL, Q-Learning and then we taught the Q-Learning agent to play tic-tac-toe with different opponents and finally checked the results. It’s awesome to see that we can teach an artificial agent to play a tic-tac-toe from scratch!

In the future, I am planning to see if I can use a similar deep RL architecture to what’s used in AlphaGo Zero by DeepMind (A deep neural network and a Monte Carlo Tree Search) and I think it’s going to be a fun one!

Hope you enjoyed the article. Thanks for reading!

Teaching agents to play tic-tac-toe using Reinforcement Learning

Written by Kaneel Senevirathne