Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Playing Games with Python — Part 1: Tic Tac Toe

11 min readDec 15, 2018

--

If you happened to glance at my very first article on Ju-Jitsu, I hinted at the fact that I have a bittersweet relationship with video games. They give me an amazing ego-boost when I win, but every once in a while I want to put my fist through my monitor when some kid bests me at Rocket League. Before we get into the debatably boring technical stuff (which is the majority of this article), let’s take a trip down my memory lane to explain my motivation for this article:

My history with video games goes as far back till the Nintendo days trying to rescue Princess Peach to the recent toxic hellhole of esports. I dipped my fingers in every genre out there — MMORPGs, Shooters, Platformers, MOBAs, etc(Google these terms if you are a muggle), but it was the MOBAs and Shooters which I kept coming back to. If anyone can relate to having a dogshit internet connection, you’ll know that playing online matches is a definite no-no. So I often found myself in the ‘VS bots’ part of the games, where I could hone my skills like a gladiator, completely dominate them, feel ecstatic, go online to play a ranked match, get completely dominated, watch my elo ranking go down the drain, cry (In that exact order). It didn’t take me long to realize that the so-called AI bots bore a lot of similarity to me — by being utterly useless. Thus my motivation to create better AI bots so I have more fun playing games (yes, it’s a completely selfish reason).

Let’s now get back to the topic at hand — ’ Solving Tic-Tac-Toe with a bunch of code’. A keen viewer might note that I used the phrase ‘bunch of code’ simply because I didn’t want to focus on just the Reinforcement Learning techniques to solve the games, but also explore other, although inefficient, techniques such as Tree Search, Genetic Algorithms, etc. I would have loved to go for a more complex game like Chess, but being a beginner in the field, I decided to kill my ego and start with a game with one of the smallest state spaces — Tic-Tac-Toe.

Let’s take a look at our first gladiator:

1. The Minimax Algorithm

Minimax Algorithm is a decision rule formulated for 2 player zero-sum games (Tic-Tac-Toe, Chess, Go, etc.). This algorithm sees a few steps ahead and puts itself in the shoes of its opponent. It keeps playing and exploring subsequent possible states until it reaches a terminal state resulting in a draw, a win, or a loss. Being in any of these possible terminal states has some utility for the AI — such as being in a ‘Win’ state is good (utility is positive), being in a ‘Loss’ state is bad (utility is negative), and being in a draw in neither good nor bad (utility is neutral).

The Minimax Algorithm

In our execution of the Minimax algorithm for solving Tic-Tac-Toe, it works by visualizing all future possible states of the board and constructs it in the form of a tree. When the current board state is given to the algorithm (the root of the tree), it splits into ’n’ branches (where n denotes the number of moves that can be chosen by the AI/number of empty cells where the AI can be placed). If any of these new states is a terminal state, no further splits are performed for this state and it is assigned a score the following way:

  • score = +1 (if AI wins)
  • score = -1 (if AI loses)
  • score= 0 (If a draw happens)
Press enter or click to view image in full size
An example Minimax game tree for Tic-Tac-Toe

If, however, there are no terminal states — each of these new states is considered as new root and they give rise to a tree of their own. But, there’s a catch — since this is a 2-player game and the players take turns alternatively, therefore whenever we go one layer deeper in the network, we need to change the player which would be placed in one of the empty cells. This way we can visualize what moves the other player would take as a result of our move. The algorithm evaluates the best move to take by choosing the move which has the maximum score when it is the AI’s turn and choosing the minimum score when it is the Human’s turn.

Writing it out in code:

Complete code: https://github.com/agrawal-rohit/medium-playing-games-with-python/blob/master/Tic%20Tac%20Toe/HumanvsAI_Minimax.py

Since the state space of Tic-Tac-Toe is so small, we cannot have a tree with depth more than 9. Thus we don’t need to use techniques like alpha-beta pruning here. The thing with Minimax, however, is that it assumes a particular way of playing by the opponent. For example, a minimax player would never reach a game state from which it could lose, even if it always won from that state because of incorrect play by the opponent.

Here is a sample game:

New Game!
----------------
| || || |
----------------
| || || |
----------------
| || || |
----------------
Choose which player goes first - X (You - the petty human) or O(The mighty AI): O
AI plays move: 2
----------------
| || O || |
----------------
| || || |
----------------
| || || |
----------------
Oye Human, your turn! Choose where to place (1 to 9): 3
----------------
| || O || X |
----------------
| || || |
----------------
| || || |
----------------
AI plays move: 9
----------------
| || O || X |
----------------
| || || |
----------------
| || || O |
----------------
Oye Human, your turn! Choose where to place (1 to 9): 5
----------------
| || O || X |
----------------
| || X || |
----------------
| || || O |
----------------
AI plays move: 7
----------------
| || O || X |
----------------
| || X || |
----------------
| O || || O |
----------------
Oye Human, your turn! Choose where to place (1 to 9): 6
----------------
| || O || X |
----------------
| || X || X |
----------------
| O || || O |
----------------
AI plays move: 4
----------------
| || O || X |
----------------
| O || X || X |
----------------
| O || || O |
----------------
Oye Human, your turn! Choose where to place (1 to 9): 1
----------------
| X || O || X |
----------------
| O || X || X |
----------------
| O || || O |
----------------
AI plays move: 8
----------------
| X || O || X |
----------------
| O || X || X |
----------------
| O || O || O |
----------------
Draw!
Wanna try again BIYTACH?(Y/N) : n

Let’s now take a look at a much more exciting algorithm. Bring in the second gladiator:

2. Reinforcement Learning

I feel like this algorithm is easier to grasp as you might be using this every day without even realizing it. Let’s take a practical example:

Say you are a dog owner. You pestered your spouse for months and finally got the little fucker, but this furball(our agent) needs to potty train. The physical world where our agent can interact and well, go doodoo, is our environment. The problem is simple — We want our canine buddy to do his business in the place we desire. But how do we tell it which place good or bad, without attempting to ‘talk dog’ and look stupid? Right, we use a reward system. Whenever our agent shits on our fancy carpet, it gets a negative reward(calling him a bad boy, revoking a treat, a violent boop on the nose, etc). Whenever it shits in front of our neighbor’s door who plays Justin Beiber on a loudspeaker at 2 am, it gets a positive reward(calling him the best boy, his favorite treat, belly rubs, etc). Over time, the agent learns that whenever it wants to let loose(a particular state), it’s better off soiling the neighbor’s pavement than the precious carpet, as it would result in him being in a better state, the one which grants a positive reward.

Exploitation vs Exploration

One of the fundamental tradeoffs in reinforcement learning is the exploitation vs exploration tradeoff. Exploitation means choosing the action which maximizes our reward(may lead to being stuck in a local optimum). Exploration means choosing an action regardless of the reward it provides(this helps us discover other local optimum which may lead us closer to the global optimum). Going all out in either one of them is harmful, all exploitation may lead to a suboptimal agent, and all exploration would just give us a stupid agent which keeps taking random actions.

A widely used strategy to tackle this problem, which I have also used in my implementation, is the epsilon-decreasing strategy. It works as follows:

  1. Initialize a variable ‘epsilon’, with a value between 0 and 1 (usually around 0.3)
  2. Now with probability = epsilon, we explore and with probability = 1-epsilon, we exploit.
  3. We decrease the value of epsilon over time until it becomes zero

Using this strategy, the agent can explore better actions during the earlier stages of the training, and then it exploits the best actions in the later stages of the game. It’s kind of similar to how us humans function — We explore different options and seek new experiences in our early 20’s(exploration), and then we decide on a set career path and keep moving on that path(exploitation).

Temporal Difference Learning

The method of reinforcement learning used in this implementation is called temporal difference learning. It works on the principle that each state has a value attached to it. Say, if a state leads to the AI winning, it shall have a positive value(value = 1). If AI loses in some state, it shall have a negative value(value = -1). All the rest of the states would have a neutral value(value = 0). These are the initialized state values.

Once a game is started, our agent computes all possible actions it can take in the current state and the new states which would result from each action. The values of these states are collected from a state_value vector, which contains values for all possible states in the game. The agent can then choose the action which leads to the state with the highest value(exploitation), or chooses a random action(exploration), depending on the value of epsilon. Throughout our training, we play several games, after each move, the value of the state is updated using the following rule:

Press enter or click to view image in full size
Temporal Difference Update Rule

where, V(s) — the current state of the game board, V(s^f) — The new state of the board after the agent takes some action, and alpha — learning rate/ step-size parameter.

Using this update rule, the states that lead to a loss get a negative state value as well(whose magnitude depends on the learning rate). The agent learns that being in such a state may lead to a loss down the line, so it would try to avoid landing in this state unless necessary. On the other hand, the states that lead to a win, get a positive state value. The agent learns that being in such a state may lead to a win down the line, so it would be encouraged to be in this state.

The code snippet for this algorithm is as follows:

There are 2 versions of the code for this algorithm:

  1. The testing code — You can play against the trained AI: https://github.com/agrawal-rohit/medium-playing-games-with-python/blob/master/Tic%20Tac%20Toe/testing_(HumanvsAI)_ReinforcementLearning.py
  2. The training code — both the players are AI, where each of them helps train each other. This one is for my lazy squad out there. If you prefer grabbing some popcorn and letting the two AI fight it out among themselves, then here’s the code: https://github.com/agrawal-rohit/medium-playing-games-with-python/blob/master/Tic%20Tac%20Toe/training_(AIvsAI)_ReinforcementLearning.py

Following is a sample game against a bot trained for ~10000 epochs.

New Game!
----------------
| || || |
----------------
| || || |
----------------
| || || |
----------------
Choose which player goes first - X (You - the petty human) or O(The mighty AI): XOye Human, your turn! Choose where to place (1 to 9): 5
----------------
| || || |
----------------
| || X || |
----------------
| || || |
----------------
Possible moves = [1, 2, 3, 4, 6, 7, 8, 9]
Move values = [-0.218789, -0.236198, -0.147603, -0.256198, -0.365461, -0.221161, -0.234462, -0.179749]
AI plays move: 3
----------------
| || || O |
----------------
| || X || |
----------------
| || || |
----------------
Oye Human, your turn! Choose where to place (1 to 9): 1
----------------
| X || || O |
----------------
| || X || |
----------------
| || || |
----------------
Possible moves = [2, 4, 6, 7, 8, 9]
Move values = [-0.633001, -0.625314, -0.10769, -0.543454, -0.265536, 0.034457]
AI plays move: 9
----------------
| X || || O |
----------------
| || X || |
----------------
| || || O |
----------------
Oye Human, your turn! Choose where to place (1 to 9): 6
----------------
| X || || O |
----------------
| || X || X |
----------------
| || || O |
----------------
Possible moves = [2, 4, 7, 8]
Move values = [-0.255945, 0.003558, -0.2704, -0.25632]
AI plays move: 4
----------------
| X || || O |
----------------
| O || X || X |
----------------
| || || O |
----------------
Oye Human, your turn! Choose where to place (1 to 9): 2
----------------
| X || X || O |
----------------
| O || X || X |
----------------
| || || O |
----------------
Possible moves = [7, 8]
Move values = [0.0, 0.03941]
AI plays move: 8
----------------
| X || X || O |
----------------
| O || X || X |
----------------
| || O || O |
----------------
Oye Human, your turn! Choose where to place (1 to 9): 7
----------------
| X || X || O |
----------------
| O || X || X |
----------------
| X || O || O |
----------------
Draw!
Wanna try again BIYTACH?(Y/N) : n
Suit yourself bitch!

It’s Showtime

Now that we have our 2 champions ready, let’s throw them into the virtual Colosseum and let them fight it out while we watch in awe. Since we just have 2 of them right now, we’ll just have a 1v1 TKO battle. This was the result:

Results(10 games):
Minimax Wins = 0
RL Agent Wins = 10

(I have summoned a monster, haven’t I?)

If you wish to see the whole match, here’s the code: https://github.com/agrawal-rohit/medium-playing-games-with-python/blob/master/Tic%20Tac%20Toe/Showdown_Minimax_vs_RL.py

These are the only algorithms I have tinkered with yet. I might look into some other interesting algorithms like Genetic Algorithms, but that’s for later. If you found my writing remotely helpful or even amusing enough to make you chuckle slightly, throw a clap and drop a response below.

Thanks for reading ❤

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Rohit Agrawal
Rohit Agrawal

Written by Rohit Agrawal

Software Engineer @ FordPro Charging. Passionate about Programming, Guitar, and Product development — Work with me: https://rohit.build

Responses (5)