Part 2 — The Min Max Algorithm

Carsten Friedrich
6 min readJun 6, 2018

--

In this Notebook, we will introduce and then use the Min-Max algorithm to create a computer player which will be able to play Tic Tac Toe.

In fact, it will be able to play the game perfectly. That is, the player will always play the best possible move in a given situation. This player will give us a good benchmark to compare the other players against.

The Min-Max algorithm

So, what is this Min-Max algorithm that we want to implement?

The long answer can be found here. We won’t go into that much detail here and just look at the general idea:

Given a board state, we find the best move by simulating all possible continuations from this position and chose the one that is best for us. The one best for us is the one with the best outcome if we assume:

  • we always make the move that is best for us (Maximizes the game value for us) and
  • our opponent always makes the move that is best for them (and thus worst for us — Minimizing the game value for us).

You can see where the algorithm gets its name from. This algorithm, and variations of it, is very popular for writing game engines that play games like Tic Tac Toe, Chess, Checkers, etc.

Side Note: For games with a large number of different board positions like Chess, the algorithm will generally not be able to completely simulate all continuations. In those cases the algorithm will only look ahead a certain number of moves and estimate the value of the position then. Also, more advanced versions of this algorithm, e.g. alpha / beta pruning are used in those cases.

Let’s look at an example. Given the following board state and NAUGHT to move next:

The following continuations are possible:

That is: first NAUGHT, the maximizing player, gets to move, then CROSS, the minimizing player, gets to move and in those cases where the game has not ended at that point, NAUGHT, the maximizing player, gets one more move:

We label all final game states according to their value from the point of view of Naught:

  • 1 for a win
  • -1 for a loss
  • 0 for a draw

Now we can back-propagate the scores from the bottom layer to the layer above. According to the algorithm, as it is the maximizing player’s turn, we chose the move with the highest score.

Note that in this initial case, as there is only one possible move and the move thus is forced, we just propagate that value one layer up without having to chose a maximizing move:

Now we propagate up again. This time it is the minimizing player’s turn, so we propagate the smaller values for each possible move up:

Finally, we propagate one more layer up. This time it’s the maximizing player again, so we chose the highest possible value of all moves for the position:

Now, we know everything we need to know to make a move:

  • The best we can hope for, if both we and our opponent always plays their best move, is a draw (since the value of the current board position is 0).
  • We also know that there is only 1 move in the current situation that will achieve this best case for us: Putting a NAUGHT in the middle spot on the top row.

Note that there are other potential continuation that would also lead to a draw, and some that might even lead to NAUGHT winning. Unfortunately, however, we also know now that if CROSS always plays their best move, we won’t ever have a chance to to reach any of these outcomes.

The Min-Max Player Code

Our code contains 2 player classes which implement the Min Max algorithm for playing Tic Tac Toe:

  • MinMaxAgent.py: Plays Tic Tac Toe using the Min Max Algorithm in a deterministic way. I.e. if there is more than 1 move with equal best scores in a given position, this player will always chose the same move.
  • RndMinMaxAgent.py: Plays Tic Tac Toe using the Min Max Algorithm in a non-deterministic way. I.e. if there is more than 1 move with equal best scores in a given position this player will each time randomly chose between them.

In order to make things a bit more efficient, the players will also store the value for a given board position in an internal cache. This means, that they only have to compute the possible continuations from each position once. It even makes this first computation more efficient, as often different move combinations will produce the same board positions, which, with the cached result, we don’t have to evaluate again.

Side Note: Even on a moderately fast computer this works quite well due to the small number of possible board positions in Tic Tac Toe:

While a game can have something like 9!=362,800 different possible move combinations, i.e. 9 choices for the first move, 8 choices for the second move, 7 choices for the 3rd move etc down to 1 choice for the last move (and for simplicity ignoring the fact that the game can be over before all squares are occupied — thus reducing the actual number of move combinations), the game can only have 3^9=19,683 different states as each square can only either be empty, have a NAUGHT, or have a CROSS in it (again for simplicity ignoring the fact that these include game states that are impossible in a real game; also ignoring the fact that we could reduce the number of states further by treating symmetric position as being the same).

Let’s see how they perform.

First we define a small utility function which we call battle to repeatedly pit 2 players against each other for a number of games. After the "battle" has finished, the function will return statistics about who won how often.

As expected, and to my relief, the Min Max player does never lose.

Baseline Part 2

This gives us an additional baseline:

Player            | P1 Win | P2 Win  |  Draw 
==============================================
Min Max — Random | 99.5% | 0% | 0.5%
Random — Min Max | 0% | 80% | 20%
Min Max — Min Max | 0% | 0% | 100%

This means, in order to be considered as playing better than pure random, a player should achieve significantly more than 20% draws against a Min-Max player when going first and significantly more than 1% draws when going second.

In the next part we will look at our first player using reinforcement learning: The Tabular Q-Learner.

--

--