How 304 boxes never lose at tic-tac-toe, and how it’s used in creating Artificial Intelligence
Machine Learning — the study and implementation of algorithms that make computers learn, is usually seen as an overcomplicated study that is out of reach for most people. After all, how do you write code, so that the code learns stuff you didn’t write? How do you code something to learn and pick up new strategies? Artificial Intelligence is a massive and complicated field in Computer Science, however, the fundamentals ideas can be simplified and understood. So before I answer how machines learn, I will first answer how do 304 boxes master tic-tac-toe.
Tic-tac-toe, even though only having 9 cells, has exactly 255,168 different game variations. However, to simplify and lower this number, we will only need to focus on the different possible positions. This comes to just under 6,000. So doesn’t matter how you play, the position on the board will always be one of 6,000. But if you take into account board rotation and symmetry, you can shrink this number down to as low as 304 (all the patterns below will count as 1 instead of 8).
So now that we have 304 possible positions on tic-tac-toe, what’s next? This is where Matchbox Educable Noughts and Crosses Engine (MENACE) comes in. MENACE is a matchbox computer (yes you read that right), that was made by Donald Michie in the 1960s to try and make a machine that would always win in tic-tac-toe — or at least tie when winning isn’t possible. Since in the 60s sophisticated learning models weren’t available, everything had to be done mechanically. So MENACE was a computer literally made out of matchboxes.
The way it would work is pretty simple. On the side of every box, Donald Michie placed a sticker with one of the 304 combinations — so that every position corresponds to one box. Inside these boxes, he placed different colour beads, with each colour of beads representing a possible move MENACE can make in that specific position. The way MENACE makes a move is by randomly selecting a bead from that box. So during MENACEs turn, it will reach into the box that reflects the current position on the board, and grab a random bead from inside the box. Then, MENACE will make the move that the bead specifies. In the starting position, only a single bead of each colour is present in each box, but depending on the result of the game the number of beads will change. If MENACE wins, MENACE goes over all the positions it played and adds 3 of the beads it used to each box. If it ties, it adds one bead, and if loses — removes 1 (as long as a minimum of 1 bead of each colour stay in the box).
This way, if the green bead helped win with the position on box 7, three more green beads will be added to box 7. This ensures that next time the green bead will be more likely to be picked, resulting in MENACE playing the green move — which proved to be productive in the past. What ends up happening is that the first game MENACE plays it will play randomly, but the more games it plays the more beads of the correct colours are added the better it generally plays. So after setting up MENACE (with 1 bead of each colour in each box) the first game will be completely random (because there’s an equal amount of all beads in all boxes) but the more it plays, the better it will get.
This is simple. So simple, that people have made machines that learn and adapt using just boxes and beads. This can easily be converted to code, where you can store everything on your computer instead of in matchboxes. Better yet, you can also use code to train your machine. Very easily (in 3 lines of code) you can make an algorithm that clicks empty spots randomly, introducing your MENACE to every situation, and training it like that. When I built my code, my trainer would click every 10 milliseconds, training the machine extremely fast. Within minutes the machine can already master tic-tac-toe.
How fast does MENACE learn? if you were to reset all its settings and place it in front of a perfect user, one that never loses, MENACE will start playing equally as perfect after about only 35 games (in the graph below the higher the beads the more sure MENACE is of the move its about to make).
Against a random clicker, it caught on after about 70 games and stopped losing and nearly ever tied (to play around with MENACE click here). So how exactly is MENACE learning and how is it used in modern machine learning?
MENACE is learning to refine its strategy using reinforcement learning, one of the three paradigms in machine learning. MENACE was also used as one of the first examples of how reinforcement learning works. The three main ways machines learn to learn are reinforcement learning, supervised learning, and unsupervised learning. What differs reinforcement learning is that it focuses on finding a balance between exploration and exploitation. Exploration being uncharted or new territory (such as picking a bead randomly and seeing if it’ll win or not) and exploitation is using current knowledge (such as picking a bead with a higher success rate in the past).
Another main part of reinforcement learning is the reward. Good behaviours are treated with a positive rewards, while bad behaviours might be treated with negative ones. So in the beginning when the machine knows very little, or nothing at all, it will start to make random decisions. After each decision (like picking a bead) it will be rewarded. If the decision was correct, it will be rewarded positively (adding a bead), and if it was wrong then a penalty might be introduced (taking a bead away). This way machines learn to change and adapt to the environment as every attempt improves the machine little-by-little.