Write an AI in Ruby that won’t lose in Tic-Tac-Toe

Miyama Yuta
5 min readDec 11, 2016

--

On 5th of December, OpenAI and DeepMind released its platform for more dynamic reinforcement learning environment.
http://www.wired.co.uk/article/deepmind-labs-google-code-ai

Machine learning has long been of my interest, and the exciting news finally pushed me far enough to take action.

In this article, I explain the path I took to implement a simple reinforcement algorithm from scratch, in Ruby.

Machine Learning can be intimidating

Machine learning can be intimidating for developers, for its wide range of prerequisites.

Nonetheless, it’s a never ending trend these days, and you read exciting news related to its applications every day.

I’ve been one of these guys who’s excited about the field but hasn’t dug deeper to personalize the topic, until, on 5th of December, the news struck me finally to get my ass moving.

What’s the news?

To make it clear, OpenAI Universe and DeepMind Lab both tries to provide a unified interface for AI researchers. No matter what kind of environment (often it is in the form of gaming) it is, AI researchers in the wild can now focus on the experiments rather than trying to set up and “parse” an environment.

So, these platforms are for people who’s already started implementing their AI strategies. It’s not that their platform miraculously provides you a “template AI” that you can enjoy watching and tweak.

After following their “getting started” README and found out this myself, I started hitting on other resources.

Out of more than ten introductory texts I tried, I found the following material most useful if you’re just entering the field.

The textbook suits well for an excited web developer like me; they do not prioritize rigorous mathematical proofs over the bigger picture.

Implementing is always better than just reading

I believe the experience of implementing it always sticks longer in your memory, so their early example of tic-tac-toe was a perfect thing for me to work on.
http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node10.html

So, here’s a brief explanation of what the example is trying to teach you.

First of all, the problem domain

A reinforcement algorithm solves the type of problem that you don’t know the perfect answer (in another word, you’ll never be sure if there’s a perfect fixed answer for the environment).
Reinforcement learning has a strong relation with adaptive learning; its agent changes its behavior depending on the context the agent is in at the given moment.

So in this example, the author tries to emphasize this point by stating that the agent is playing against “imperfect” players; players who sometimes makes mistakes to let the opponent win (FYI: TicTacToe, when played right, always reaches the draw state).

You can imagine that TicTacToe is being played against a nine-year-old. He’s grown up enough to make right moves “most of the time”, but he still makes himself vulnerable that the other player can take advantage to win.

What’s the algorithm?

Before implementing it, let’s make it clear what are the basic elements in this algorithm.

There’s a situation that an agent senses, there’s an action that an agent takes upon sensing the situation, and there’s a goal that an agent evaluates his actions. The book explains these are the minimal set of elements when implementing an RL (Reinforcement Learning).

The goal is translated into “reward” when in middle of the process, in a way meeting the goal is the ultimate reward for the agent.

This leads us to think of that there’s a certain “value” in each state, and values altogether create a “value function”: a series of values that makes you determine the reward at a certain point.

an agent interacts with environment and constructs his value function

This explanation can be translated into certain algorithms a computer can perform.

The case of TicTacToe

The goal is to win, and the complete opposite of goal is to lose. So, we lay out this goal reward value in a single dimensional vector (just a line).
Making 1 as the max value of it (the ultimate goal, winning condition in the game), making 0 as the minimum value of it (losing condition of the game).

As soon as it understands the game rule, it is able to map these two extreme cases into the situation that the game could fall into. In the game of TicTacToe, that is to have your playing symbol (x or o) aligned in a row (horizontally, vertically, or diagonally).

From there, every other situation is unknown for an infant AI. So let’s assume everything other than those two cases have an equal chance of either winning or losing: in a vector we just setup, it’s 0.5.

Learning happens during the game

Finally, an AI must learn how to adjust those intermediate situations. Every step leads to a next sequence consequently, so one can intuitively conclude that you can evaluate the action at step N-1 based on what happened at the step N; You can only evaluate your action after you observe what happened.

So in a simple reinforcement learning, this is the meat of the learning algorithm:
V(N-1) = V(N-1) + A(V(N) — V(N-1))
where N is the sequence number how many steps an Agent took action, and A is some kind of AI trainer adjusting parameter how strong learning should occur.

Let’s implement it

With this algorithm in our hand, the rest is a developer’s usual task; how to architect clean code.

I’ve published my first version of working code, though I must admit my code does not meet “production-ready” quality.

A few caveats:

  1. In my implementation, the program assumes an AI plays against a human player in front of a computer; it accepts his input from STDIN.
  2. Once the game reaches its ending state, the agent persists his learned values in a file. I’ve taken the easy path dumping values in binary data, but I would serialize the values in another (safer, as well as human-readable) form when refactoring this.

Conclusion

In this article, I provided a way to implement one of simplest forms of RL implementation in Ruby.

For a game as simple as TicTacToe, an AI can learn not to lose as fast as after a few games of play.

Applications of RL are vastly extended when we combine this algorithm with generalization techniques (such s neural network). AI will then be able to handle an environment with so many states that it’s impractical to compute them naively (AtariGames, AlphaGo, all the recent improvements happened with the combination of these two).

I Hope this provides a yet another path for people to try out this fascinating field!

--

--