Applying AI to Tic Tac Toe

14 min readJan 11, 2024

For those who are in the Artificial Intelligence (AI) space, Tic Tac Toe has been a simple example to point to when introducing many fundamental concepts. Unfortunately, the fine details of applying AI to Tic Tac Toe often get overlooked due to it’s simplicity. In this article, I’m going to get down and dirty with all the theory behind Tic Tac Toe and how they perform.

I won’t cover implementation details here, but if you’re interested in that, you can check it out here: https://github.com/saiccoumar/TicTacToe

If you’re interested in the follow-up evaluation article check that out here: https://medium.com/@saicoumar/tic-tac-toe-ai-tournament-analyzing-strategies-and-showcasing-superiority-2b05232eb572

Problem Statement:

Let’s quickly cover some basics. The goal of any AI is to solve problems by acting humanly, thinking humanly, acting rationally, and/or thinking rationally. All of our approaches will aim to do some combination of these 4 principles.

Now let’s cover the problem. Tic Tac Toe is a turn based game where players take turns putting their sign in squares on a 3 x 3 board. The first player to get a pattern of 3 of their signs in a row (up, down, across, or diagonally) is the winner. If no winner is decided by the time all squares are filled up then the players tie. This seems obvious, but it’s important as we need to know this to define Tic Tac Toe through the context of AI.

Let’s define Tic Tac Toe as an AI task using the PEAS system

Performance Metrics: Win Rate, Efficiency, and Robustness

Environment: The 3 x 3 Tic Tac Toe Board and the agents sign

Actuators: An agent can mark any open square with their sign

Sensors: Observation of the board

The actuators and sensors were more relevant in the implementation so we’ll focus more on the performance and environment while approaching the theory. The takeaways here are that each agent will know what their mark is, take the board in as an input, and make a decision of what position to play.

Every agent has this configuration with varying “black boxes” where the algorithm’s logic computes the output

There are 3 levels of the AIs we’ll be using: Reflexive Rule-Based agents, Adversarial Search agents, and Learning-Based Models.

Reflexive Rule-Based Agents:

Reflexive rule-based agents are the most simple approach. They take predefined knowledge that human experts come up with to set policies that the agent enforces. The agent considers the board, checks for conditions based on the game state, and makes an associated choice based on the conditions fulfilled. In the context of Tic Tac Toe, coming up with these policies is relatively easy. An easy example of this is the random number generator approach.

RNG Agent

The random number generator approach is as simple as it gets. The agent will observe the board, look for a position then pick a random one. Our “rule” will be to make random moves.

It is currently X’s turn. The agent looks for open positions (green) and randomly picks the top right square

It’s debatable if this is even “intelligent”. RNG is a strategy that has the capacity to work…just not well. Let’s use some real strategies.

Corner/Center Agent

These two agents use an actual strategy to play. Generally, as humans we tend to try to take the corners or center squares since they give you more opportunities to win.

The corner agent identifies free corners and picks one to take

The center agent identifies the center space and takes it if it’s free

In both of these examples the agent now only needs one move to win, whereas the RNG agent may have chose the top middle square requiring 2 more moves to win. When the corners or center are taken these agents default back to RNG.

This AI is a bit better than RNG but still uses an extremely naive approach, and worse, might blunder. If there is a winning game state that requires picking a square that isn’t the corners or center, the agents might focus on the corners or center and lose the game because of it. Similarly, these agents have no loss prevention methodology.

One-Step

One-Step aims to solve the problems of the previous agents by checking the board if an obvious win is available.

The agent finds a square that will give it a win (and prevent a loss)

One-step’s foresight works by taking all open positions and checking if placing their mark will create a board that gives them a win, and then checking if the opponent placing their mark will create a board that gives them a loss. It literally goes one step into the future and checks if they will immediately win or lose.

If a winning opportunity is found one-step will capitalize on it. If a winning opportunity isn’t found and an opportunity to prevent a loss is detected it will make that move to continue the game. Otherwise it will default to RNG to make moves and continue the game.

Combined Rules Agent

Combining policies can help cover more possibilities, thereby increasing the win rate. Combined Rules Agent uses several rules as a checklist for decision making. The first rule is to check for forks. Forks in tic tac toe happen when a player has 2 possible ways to win and the opponent can only block one with their turn. The second rule is the One-Step rule to check for obvious wins and losses. Finally the agent uses the corner and center rules to start the game with an advantage and rng when there are no other rules to match from there. We go in reverse chronological order for conditions because forking and one-step will trigger after the initiation of the game and we don’t want initiation rules happening in the endgame.

Combined rules can be used to lead the agents to more winning combinations. Corner/Center rules help initiate while One-Step handles the endgame.

Let’s take a moment to discuss the performance of a Combined Rules Agent in Tic Tac Toe. The previous agents could play the game, but by no means were considered strong. In my testing, Combined Rules Agent was one of the strongest agents—capable of consistently beating the Monte Carlo Search Tree algorithm and drawing minimax when it was player 1. Why is that the case?

Intermission: Game State Spaces

The game state space is the set of all possible game states. With the RNG Agent, the game state space was basically all game states since it could make choices leading to any possible board combination.

Rule based agents increase the chances of winning by limiting game state spaces. Corner/Center Agents start the game with moves that tend to lead to wins which limits the game state space to a subset which has less game states that result in a loss. One-Step checks the entire game state space for one step into the future (the board after picking a position x) for winning game states and decides the position that gets you closer to that.

Now, let’s consider Tic Tac Toe. There are 9 positions on the Tic Tac Toe board each with 3 possible values (X, O, Blank).

That means there are 3⁹ = 19,683 possible ways the board can be filled in. Now that number might seem large, but for reference it’s estimated that there are between 10¹¹¹ and 10¹²³ possible moves in chess and more than 10¹⁷⁰ moves in Go.

Even without the capacity of modern computers it’s VERY easy to make rules that can cover all of those game state spaces to deterministically return a win. Tic Tac Toe is considered a “Solved Game” because of this trait.

When an agent, like my Combined Rules Agent, can make generalized rules that can restrict game state spaces in initiation (corners/center) as well mid-game moves (forks) and endgame moves (one step), you can filter the game state spaces to a subset that will often have winning outcomes which is why we see Combined Rules Agent performing extremely well.

The only way to beat this would to either have an exhaustive search where we check every outcome or to make a rule based agent that has MORE restrictive rules that can dictate exactly how to move to win.

Adversarial Search Algorithms

Minimax

Minimax is an exhaustive search algorithm that can search for the best outcomes possible. It works by minimizing the possible loss for a worst case (maximum loss) scenario and recursively calling min and max to do so.

Visual of minimax with alpha beta pruning: https://en.wikipedia.org/wiki/File:Animation_of_alpha-beta_pruning.gif

Alpha/beta pruning is an optimization strategy I included that prevents searching redundant leaf nodes. In the example above, the second min node in the 4th layer must be minimized to 2 or lower. Since the 1st min node already determined their value as 3, the max between the 1st min node with a value of 3 and the 2nd min node with a value of 2 or lower would be 3 anyways so it doesn’t matter if there’s a value lower than 2 in the 4th layer’s 2nd min node.

max-value(state, alpha, beta):
  v = -∞
  if state is terminal: return eval(state)
  for successor in state.successors:
    v = max(v, self.min-value(successor, alpha, beta))
    if v >= beta:
      return v
    alpha = max(alpha, v)
  return v

min-value(state, alpha, beta):
  v = ∞
  if state is terminal: return eval(state)
  for successor in state.successors:
    v = min(v, self.max-value(successor, alpha, beta))
    if v <= alpha:
      return v
    beta = min(alpha, v)
  return v

search():
  alpha, beta = -∞, ∞

  for sucessor in root.successors:
    score = min-value(successor, alpha, beta)
    successor.V = score
  return root.successors.get_best_action()

Understanding the minimax can be a little difficult, so seeing the pseudocode can make it very intuitive.

The evaluation function for minimax is a heuristic function which assigns a value to a result. For Tic Tac Toe, this can be as basic as -10 for a loss, 0 for a draw, and 10 for a win, but the more nuanced your evaluation function is the better minimax will theoretically perform. For example, I made another evaluation function which gave a reward of 20 for winning by fork, 10 by just winning, 0 for a draw, -10 for a loss, and -20 for a loss by fork. This will make minimax aim to win by forking rather than just winning. In Tic Tac Toe, it’s not necessary to have the most nuanced evaluation function but in other problems, a better evaluation function can greatly increase performance.

An example of minimax for tic tac toe without pruning.

Minimax works particularly well for Tic Tac Toe. Since the size of the game state space is small, the search can be completed in reasonable time and the exhaustive nature produces consistently accurate results. Minimax consistently beat all the other agents and drew or won with the combined rules agent.

The overhead was still a little long for my liking. When the board was empty the minimax often did lengthy computations and froze for a few seconds just to choose 3 and 6 every single time. The easy solution is to add rules to minimax to pick 3 and 6 at the start but I wanted to attempt to handle this with another algorithm

Monte Carlo Tree Search

The Monte Carlo Tree Search (MCTS) algorithm was one I was particularly excited to implement. It’s famous for being used alongside neural networks in chess engines to make some of the best chess bots in the world.

MCTS works by expanding the tree by adding children and then running simulations to generate a value for actions in the tree. Formally the steps are:

Select a state
Expand the state
“Rollout” from that state
Backpropagate values back to previous states

The state selection is a little intricate. If the state hasn’t been fully expanded into it’s children, we return the state so it will be expanded in the next step. If it has been expanded then either it has terminal children — in which case we rollout the same state-or we can recursively traverse down the child and see if the child should be expanded or it’s children should be expanded. We pick a child based on the Upper Confidence Bound value for a state defined by the formula:

The expansion is straightforward. If the state hasn’t been expanded we find one of its children and add it to our MCTS tree.

“Rolling out” refers to the simulation. From the current game state we play out random moves and return a value based on whether or not the AI agent won or not.

Backpropagating involves adding the value to the node and then incrementing a counter for the number of simulations. Then repeat for the parent’s state, and the grandparent’s state, and so on until it reaches the root node.

This creates a “running frontier” which expands over time until resources run out. If MCTS creates a full tree and reaches all terminal values, it continues sampling to make the values more and more accurate.

Example of the MCTS tree being built. Squiggly lines represent rollouts

MCTS is effective because you don’t need to search the entire game state space exhaustively and can always get a value and refine it over more iterations. Unfortunately, the Tic Tac Toe game state space is very small so the algorithm does not compare to the exhaustive search. It only required ~1,000 to ~10,000 iterations to return a viable result against simple agents, but it would often blunder moves against smarter agents and humans; Even at the higher iterations, using sampling via simulations would easily get outplayed by exhaustive deterministic searches.

As the game progresses past the initial moves, exhaustive searches actually performed less iterations than MCTS. With 9 positions to play on the board there are 3⁹ possible future board combinations but once there are 7 positions to play there are only 3⁷ =2 ,187 possible future board combinations which already falls below 10,000 iterations we used for MCTS. This is, again, unique to Tic Tac Toe, and Minimax continues to be the better algorithm in win rate AND efficiency. MCTS did not come through today, but it stays a promising candidate for future tasks that are more complex.

Let’s evaluate with the performance measures we established. We have agents that have a high win rate but the combined rules agent lacks robustness and minimax lacks efficiency.

Applying a learning-based model like Neural Networks may help us tackle these issues.

Neural Networks

Predict Win Neural Network Agent

When working with any learning-based agent it’s important to collect data. As someone who studies computer science I’m lazy and will always look for pre-existing datasets. After a quick google search, I found the 1991 UCI Tic Tac Toe dataset:

UCI Machine Learning Repository

Discover datasets around the world!

archive.ics.uci.edu

The data was in the format of game state with a positive or negative label which identifies whether the game will end in a win or a loss.

x,x,x,x,o,o,x,o,o,positive
x,x,x,x,o,o,o,x,o,positive
x,x,x,x,o,o,o,o,x,positive
x,x,x,x,o,o,o,b,b,positive
x,x,x,x,o,o,b,o,b,positive

x,x,o,x,x,o,o,b,o,negative
x,x,o,x,x,o,b,o,o,negative
x,x,o,x,x,b,o,o,o,negative
x,x,o,x,o,x,o,o,b,negative
x,x,o,x,o,x,o,b,o,negative

This system will work by training a model to predict whether actions will result in a win and picking choices that are predicted as a win. This is the Vanilla Neural Network model we’ll be using:

class TicTacToeModel(nn.Module):
    def __init__(self, input_size):
        super(TicTacToeModel, self).__init__()
        self.fc1 = nn.Linear(input_size, 64)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(64, 1)
        self.sigmoid = nn.Sigmoid()
  def forward(self, x):
        x = self.fc1(x)
        x = self.relu1(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x
model = TicTacToeModel(input_size).to(device)

# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
num_epochs = 5000
for epoch in range(num_epochs):
    # Forward pass
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Unfortunately as soon as I tried this approach I realized there were significant problems

The dataset only has endgames. That means for the first 6 moves, this model doesn’t have the right knowledge to make good predictions. To handle this I had other agents make decisions until the end where the NN would take over and make predictions.
Simpler algorithms perform better towards the end of the game. Rule based agents like one-step and searches like minimax and mcts don’t require nearly as many resources to fully play out a game. Therefore, the NN’s choice is redundant and underperforms.
Endgame is the worst time to make game defining moves. Once you’re there, you’re already likely to lose or win so neither choice will be particularly better than the other. This both regularly predicts both options as failing options, with one slightly better than the other.

As an endgame predictor model, this performed great. Endgame predictor models are completely useless for PLAYING Tic Tac Toe.

I was still pretty set on using a Neural Network, but instead of trying to act like a human, I figured that maybe the answer was to try to act like the best AI agents

Predict Position Neural Network Agent

With this approach, I wanted to train a neural network to act with the combined judgement of Minimax, MCTS, and the Combined Results Agents. To do this, I generated 30 thousand random game states and had each agent predict what the next move would be on 10,000 results. I then combined the data into a dataset where you have the game state as input and position as the output and then passed that into a neural network.

class TicTacToeModel(nn.Module):
    def __init__(self, input_size, output_size):
        super(TicTacToeModel, self).__init__()
        self.fc1 = nn.Linear(input_size, 256)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(256, output_size)
        self.tanh = nn.Tanh()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu1(x)
        x = self.fc2(x)
        x = self.tanh(x)
        return x
model = TicTacToeModel(input_size, output_size).to(device)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
num_epochs = 5000
for epoch in range(num_epochs):
    # Forward pass
    outputs = model(X_train)
    loss = criterion(outputs, y_train)

    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Neural networks have a unique trait of memorizing information, and then learning generalized patterns. It’s not entirely understood why this is the case because Neural Networks often have low interpretability, but this allows our NN to predict a position based on the experiences of the three agents.

If the algorithms contradicted each other the model might struggle to generalize patterns, but the algorithms should also make similar choices but with variation that gives it added robustness to prevent exploitation.

Neural Networks are also essentially just a very complex function. Rather than using an inefficient exhaustive search or thousands of simulations, you use a formula to map 9 inputs to 1 output. This makes it MUCH faster than the Adversarial Search algorithms.

I think this approach was the most fun as an AI agent. It had a high win rate (not perfect), was robust and couldn’t be exploited, and was efficient and quick.

Key Takeaways

Let’s review some of the important things we’ve learned about AI agents for Tic Tac Toe:

Reflexive Rule-Based agents are strong as they can limit the game state space to winning game states and increase their likelihood of winning
Minimax had the best win rate because Tic Tac Toe has a small state space
Monte Carlo Search Trees underperform because the game state space isn’t large enough for the value of simulations to outweigh searches
Neural Networks with good data can result in great agents that balances our performance criteria

Tic Tac Toe is a great case study for AI algorithms and understanding why certain algorithms perform they do sets up a great foundation for applying AI to more complex tasks. If you’re interested in the code or making your own agent, check out the repository for yourself. See my other projects at https://github.com/saiccoumar

In an upcoming article I’ll be putting these agents in a tournament and seeing how they stack up against each other so keep your eye out for that!