TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

An Introductory Reinforcement Learning Project: Learning Tic-Tac-Toe via Self-Play Tabular Q-learning

--

In this project, I’ll walk through an introductory project on tabular Q-learning. We’ll train a simple RL agent to be able to evaluate tic-tac-toe positions in order to return the best move by playing against itself for many games.

First, let’s import the required libraries

Note that tabular q-learning only works for environments which can be represented by a reasonable number of actions and states. Tic-tac-toe has 9 squares, each of which can be either an X, and O, or empty. Therefore, there are approximately 3⁹ = 19683 states (and 9 actions, of course). Therefore, we have a table with 19683 x 9 = 177147 cells. This is not small, but it is certainly feasible for tabular q-learning. In fact, we could exploit the fact that the game of tic-tac-toe is unchanged by rotations of the board. Therefore, there are actually far fewer “unique states”, if you consider rotations and reflections of a particular board configuration the same. I won’t get into deep Q-learning, because this is intended to be an introductory project.

First, we initialize our q-table with the aforementioned shape:

Now, let’s set some hyperparameters for training:

Now, we need to set up an exploration strategy. Assuming you understand exploration-exploitation in RL, the exploration strategy is the way we will gradually decrease epsilon (the probability of taking random actions). We need to initially play at least semi-randomly in order to properly explore the environment (the possible tic-tac-toe board configurations). But we cannot forever take random actions, because RL is an iterative process that relies under the assumption that the evaluation of future reward gets better over time. If we simply played random games forever, we would be trying to associate a random list of actions with some final game result that has no actual dependency upon any particular action we took.

Now, let’s create a graph of epsilon vs. episodes (number of games simulated) with matplotlib, saving the figure to an image file:

When we start to simulate games, we need to set some restrictions so that the agents can’t make insensible moves. In tic-tac-toe, occupied squares are no longer available, so we need a function to return the legal moves, given a board configuration. We will be representing our board by a 3x3 NumPy array, where unoccupied squares are 0, X’s are 1, and O’s are -1. We can use NumPy’s np.argwhereto retrieve the indices of the 0 elements.

We also need a helper function to convert between a 3x3 board representation and an integer state. We’re storing the future reward estimations in a q-table, so we need to be able to index any particular board configuration with ease. My algorithm for converting between the board in the format I previously described works by partitioning the total number of possible states into a number of sections corresponding to the number of actions. For each cell in the board:

  • If the cell is -1, you don’t change state
  • If the cell is 0, you change state by one-third of the window size
  • If the cell is 1, you change state by two-thirds of the window size.

Finally, we need one last helper function to determine when the game has reached a terminal state. This function also needs to return the result of the game if it is indeed over. My implementation checks the rows, columns, and diagonals for a series of either 3 consecutive 1's or 3 consecutive -1’s by taking the sum of the board array across each axis. This produces 3 sums, one for each axis or column. If -3 is one of these sums, this axis must have all -1’s, indicating that the player corresponding to -1 won and vice versa. The diagonals work just the same, except there are only 2 diagonals, while there are 3 rows and 3 columns. My original implementation is a bit naive, I found a much better one online. It’s much shorter and improves speed slightly.

Now, let’s initialize some lists to record training metrics.

past_results will store the results of each simulated game, with 0 representing a tie, 1 indicating that the player corresponding to the positive integer won, and vice versa with -1.

win_probs will store a list of percentages, updated after each episode. Each value tells the fraction of games up to the current episode in which either player has won. draw_probs also records percentages, but corresponding to the fraction of games in which a draw occurred.

After training, if we were to graph win_probs and draw_probs, they should demonstrate the following behavior.

  1. Early in training, the win probability will be high, while the draw probability will be low. This is because when both opponents are taking random actions in a game like tic-tac-toe, there will more often be wins than draws simply due to the existence of a larger number of win states than draw states.
  2. Mid-way through training, when the agent begins to play according to its table’s policy, the win and draw probabilities will fluctuate with symmetry across the 50% line. Once the agent starts playing competitively against itself, it will encounter more draws, as both sides are playing according to the same strategic policy. Each time the agent discovers a new offensive strategy, there will be a fluctuation in the graph, for the agent is able to trick its opponent (itself) for a short period of time.
  3. After fluctuating for a while, draw probabilities should approach 100%. If the agent was truly playing optimally against itself, it would always encounter a draw, for it is attempting to maximize reward according to a table of expected future rewards… the same table being used by the opponent (itself).

Let’s write the training script. For each episode, we begin at a non-terminal state: an empty 3x3 board filled with 0’s. It each move, with some probability epsilon, the agent takes a random action from the list of available squares. Otherwise, it looks up the row of the q-table corresponding to the current state and selects the action which maximizes the expected future reward. The integer representation of the new board state is computed, and we record the pair (s, a, s’). Once this game ends, we will need to correlate the state-action pair we just observed with the final game results (which are yet-to-be-determined). Once the game ends, we refer back to each recorded state-action pair, and update the corresponding cell of the q-table according to the following:

Q-learning update rule

In the above update formula, s is the integer representation of the state, a is the integer representation of the action the agent took at state s , alpha is the learning rate, R(s, a) is the reward (in our case, the end result of the corresponding game in which this pair (s, a) was observed), Q is the q-table, and the statement involving max represents the maximum expected reward for the resulting state. Say the board configuration was:

[[0, 0, 0],
[0, 0, 0],
[0, 0, 1]]

and we took the action 3, corresponding to the cell at coordinate (1, 0), the resulting state would be:

[[0  0, 0],
[-1, 0, 0],
[0, 0, 1]]

This part of the update formula is referring to the maximum expected reward for any of the actions we could take from here, according to the policy defined by our current q-table. Therefore, s' is the second state I just described, and a' is all of the actions we could theoretically take from this state (0–8), although in reality, some are illegal (but this is irrelevant).

At the end of every 1000 episodes, I just save the list of training metrics, and a plot of these metrics. At the end, I save the q-table and the lists storing these training metrics.

Results

I trained mine with Google Colab’s online GPU, but you can train yours locally if you’d like; you don’t necessarily have to train all the way to convergence to see great results.

Just as I previously mentioned, the relationship between games terminating in a win/loss and those terminating in a draw should work as follows:

  • Earlier in training, an unskilled, randomly-playing agent will frequently encounter win-loss scenarios.
  • Each time the agent discovers a new strategy, there will be fluctuations.
  • Towards the end of training, near convergence, the agent will almost always encounter a draw, as it is playing optimally against itself.

Therefore, the larger fluctuations in the graph indicate moments when the agent learned to evaluate a particular board configuration very well, and in doing so this allowed it to prevent draws.

We can see this is clearly demonstrated in the resulting graph:

Win/Loss-Draw Ratio Throughout Training

Throughout the middle of training, it frequently appears as if the q-table will converge, only to quickly change entirely. These are the aforementioned moments when a significant strategy was exploited for the first time.

Also, as you can see, the fluctuations occur more rarely as you progress throughout training. This is due to the fact that there are less yet-to-be-discovered tactics as your progress. Theoretically, if the agent converged, there would never be any more great fluctuations like this. Draws would occur 100% of the time and after the rapid rise in the draw percentage, it would not fall back down again.

I decided it would be a good idea to visualize the change in the q-values over time, so I retrained it while recording the sum of the absolute values of the Q table at each episode. Whether a particular q-value is positive or negative recording the sum of all absolute q-values shows us when convergence is occurring (the gradient of the q-values over time decreases as we reach convergence).

Win/Loss-Draw Ratio Throughout Training + Sum of Absolute Value of Q-table throughout Training

You can visit the full code on Google Colab here:

Or on GitHub here:

Experimenting with the exploration strategy will influence training. You can change parameters relating to epsilon, as well as how it is decayed in order to get different results.

One final thing to note is that Tic-Tac-Toe can be approached much more easily with simpler value iteration methods because the transition matrix is a given, as well as the reward matrix. This sort of epsilon-greedy optimization is really unnecessary for an environment like Tic-Tac-Toe.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Ryan Rudes
Ryan Rudes

Written by Ryan Rudes

I am a student at the California Institute of Technology, majoring in Electrical Engineering and Computer Science

No responses yet