Write an AI in Ruby that won’t lose in Tic-Tac-Toe

On 5th of December, OpenAI and DeepMind released its platform for more dynamic reinforcement learning environment.
http://www.wired.co.uk/article/deepmind-labs-google-code-ai

Machine learning has long been of my interest, and the exciting news finally pushed me far enough to take action.

In this article, I explain the path I took to implement a simple reinforcement algorithm from scratch, in Ruby.

Machine Learning can be intimidating

Nonetheless, it’s a never ending trend these days, and you read exciting news related to its applications every day.

I’ve been one of these guys who’s excited about the field but hasn’t dug deeper to personalize the topic, until, on 5th of December, the news struck me finally to get my ass moving.

What’s the news?

So, these platforms are for people who’s already started implementing their AI strategies. It’s not that their platform miraculously provides you a “template AI” that you can enjoy watching and tweak.

After following their “getting started” README and found out this myself, I started hitting on other resources.

Out of more than ten introductory texts I tried, I found the following material most useful if you’re just entering the field.

The textbook suits well for an excited web developer like me; they do not prioritize rigorous mathematical proofs over the bigger picture.

Implementing is always better than just reading

So, here’s a brief explanation of what the example is trying to teach you.

First of all, the problem domain

So in this example, the author tries to emphasize this point by stating that the agent is playing against “imperfect” players; players who sometimes makes mistakes to let the opponent win (FYI: TicTacToe, when played right, always reaches the draw state).

You can imagine that TicTacToe is being played against a nine-year-old. He’s grown up enough to make right moves “most of the time”, but he still makes himself vulnerable that the other player can take advantage to win.

What’s the algorithm?

There’s a situation that an agent senses, there’s an action that an agent takes upon sensing the situation, and there’s a goal that an agent evaluates his actions. The book explains these are the minimal set of elements when implementing an RL (Reinforcement Learning).

The goal is translated into “reward” when in middle of the process, in a way meeting the goal is the ultimate reward for the agent.

This leads us to think of that there’s a certain “value” in each state, and values altogether create a “value function”: a series of values that makes you determine the reward at a certain point.

an agent interacts with environment and constructs his value function

This explanation can be translated into certain algorithms a computer can perform.

The case of TicTacToe

As soon as it understands the game rule, it is able to map these two extreme cases into the situation that the game could fall into. In the game of TicTacToe, that is to have your playing symbol (x or o) aligned in a row (horizontally, vertically, or diagonally).

From there, every other situation is unknown for an infant AI. So let’s assume everything other than those two cases have an equal chance of either winning or losing: in a vector we just setup, it’s 0.5.

Learning happens during the game

So in a simple reinforcement learning, this is the meat of the learning algorithm:
V(N-1) = V(N-1) + A(V(N) — V(N-1))
where N is the sequence number how many steps an Agent took action, and A is some kind of AI trainer adjusting parameter how strong learning should occur.

Let’s implement it

I’ve published my first version of working code, though I must admit my code does not meet “production-ready” quality.

A few caveats:

  1. In my implementation, the program assumes an AI plays against a human player in front of a computer; it accepts his input from STDIN.
  2. Once the game reaches its ending state, the agent persists his learned values in a file. I’ve taken the easy path dumping values in binary data, but I would serialize the values in another (safer, as well as human-readable) form when refactoring this.

Conclusion

For a game as simple as TicTacToe, an AI can learn not to lose as fast as after a few games of play.

Applications of RL are vastly extended when we combine this algorithm with generalization techniques (such s neural network). AI will then be able to handle an environment with so many states that it’s impractical to compute them naively (AtariGames, AlphaGo, all the recent improvements happened with the combination of these two).

I Hope this provides a yet another path for people to try out this fascinating field!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store