# Get a taste of reinforcement learning — implement a tic tac toe agent

I got really interested into reinforcement learning after seeing the Berkeley robotic arm demo, as this could be one of the ways AI interacts with the real world. I dreamed of making robots doing kitchen chores, such as peeling potatoes and this method looks like a potential solution.

The idea behind this is called deep reinforcement learning. Similar idea has been applied to computer AI that plays video games automatically. The most famous paper in this area is this:

**[1312.5602] Playing Atari with Deep Reinforcement Learning**

*Abstract: We present the first deep learning model to successfully learn control policies directly from high…*arxiv.org

Here is a good introductory material about the reinforcement learning method used in the above paper:

**Guest Post (Part I): Demystifying Deep Reinforcement Learning**

*Two years ago, a small company in London called DeepMind uploaded their pioneering paper "Playing Atari with Deep…*www.nervanasys.com

I think the basic idea of reinforcement learning can be summarized as the follows:

To accomplish a goal, one often needs to perform several steps. Each step will receive an immediate feedback (a score for example) measuring how much this step helps achieve the final goal. In addition, this step also changes the state of the system.

To make a smart move, however, one needs to consider not only the immediate reward, but also future probability of winning.

While it is easy to calculate the immediate gain, it’s difficult to know the future probability of winning. This is where machine learning comes in. This future probability of winning is predicted by a quality function (the q function). The traditional way of getting the q function is an iterative process, where you start with a random q function, and gradually it converges to a good enough quality function.

More formally, suppose at anytime, the status of the system, be it a game or a robot, can be represented by a state S. For example, if the system you are looking at is a Super Mario game, the state S can be the lives you have, the scores, the speed and coordinates of Mario and those enemies. If the system is a robotic arm, then the state S might contain the angle and force of the joints.

Now suppose we have a function R(S), which means the total reward from the state S till the end of the game. R(S) can be represented by the following

R(S) = r + r(S+1)+r(S+2)+r(S+3)…

Where S means the current game state. The reward equals to an immediate reward r plus the reward of the next state S+1, and all the following rewards of state S+2, S+3 …

Now to make a good decision, your action at the state S should maximize the R(S). This decision takes not only the short term reward r into consideration but also the long term rewards, so that your move is not shortsighted.

Ideally, you should look at the scores of following states all the way to the end of the game. But in practice, because the system is very dynamic and noisy, looking too far ahead is unnecessary. Therefore a decay factor d should be added:

R(S) = r + d*r(S+1)+(d^2)*r(S+2)+(d^3)*r(S+3)…

This decay factor means that the further a state is away from the current state, the less it has impact on the current decision making.

Now suppose we call the maximized R(S) as Q(S), with the above decay factor, the equation can be simplified to:

Q(S) = r+ d Q(S+1)

This function equation is called the Bellman equation. The way to solve this equation is often an iterative process, where a random Q function, which is defined as a look-up table, is used at first. The function will converge if you repeatedly apply the above equation on different states and actions. There are not many materials about why it converges. So I don’t understand it at the moment. But my gut feeling is that it has something to do with back induction, where the steps that are close to the end of a game is corrected first, and the model gradually propagates to the beginning steps.

To be more clear, suppose you want to train an AI to play chess. It is very difficult to figure out which first move can improve your chance of winning. Instead, one should learn the strategy in a backward manner by looking at the last step first, for example, at a check situation. When one side is in check, one has to seek defense otherwise he immediately loses the game. Since checking gives you less options, you can easily make good decisions. Once these decision strategies are learnt, the AI will look at the second last steps to maybe avoid checking or conduct checking. And similarly it will learn the third last steps, and the forth last steps, till the beginning position. Numerically, you will see the Q function converges as it learns the strategies.

This is called reinforcement learning. Deep reinforcement learning is similar but uses a neural network to represent the Q function instead of a look-up table. The benefit of deep reinforcement learning is that it can handle high dimension data, i.e. when you have too many properties for each state and the actions you can perform are also numerous.

To prevent myself from being discouraged, I decided to first apply reinforcement learning to the simplest problem I can think of — A tic tac toe game. The code I’m using is a simple neural network implementation I made a while ago. I’m using Sigmoid activation with cross entropy cost function, which may not be ideal for the problem. But that is the only code I have.

The network is simple as hell with only one hidden layer. The input layer is 9 float numbers. Each number represents the occupation of a cell of the board. 1 represents a cross, -1 means a circle and 0 means empty. I don’t explicitly tell the neural network which side is the current side.

The hidden layer is also of size 9. And the output layer is a single number.

To train, I apply the following logic:

- I want to tell my AI very little information about the game. In fact, I just identify the winning and losing situation for each side.
- If the situation is winning for the cross, I assign a reward score 1 to the position. And if the situation is winning for the circle, I assign a reward score 0.
- I start with the empty board. Recursively I put pieces on every possible location on the board starting with the cross.
- For each position, if it is a final step (one side win, or a position without an empty location), I train the network with an immediate reward (1 or 0 depends on the winning side, or 0.5 if it is a tie). Otherwise, I calculate a reward with the current neural network. If the side is cross, I look for the highest reward possible. If the current side is circle, I look for the lowest reward.
- Basically the input of my network is a position, the output of the network is a number that indicates which side this position favors. The larger the number is, the more the position will favor the cross.

recursiveTrain(with an initial position, and the current side as inputs)

{

for each empty location of the initial position

{

put a piece of the current side at this location (and call the resulting position A)

if after putting this piece, the cross side wins

{

reward = 1;

}

else if the winning side is the circle

{

reward = 0;

}

else

{

for each empty location of the position A

{

put a piece of the opposite side at the location

calculate the reward of doing this action by running this position through the neural network.

if the current side is the cross, look for the maximum reward returned by the network.

else if the current side is the circle, look for the minimum reward.

}

}

train the neural network with (Position A and reward)

call recursiveTrain with position A as the initial position, and the opposite side

}

}

Here is the result I got so far. The cross is myself and the circle is the AI. Notice that it knows to occupy the center location first. That is exactly what a human player would do. I tried to put the initial cross at all places instead of the center location, and the computer’s decision is pretty consistent. It chooses the center location whenever possible.

And it knows to block me when I form two crosses in a row (check).

But it doesn’t seem to work for this try:

Instead of blocking me as the second step, it puts the circle at the top left corner. I found it difficult to teach defense. I need to see if the network can converge further or I can improve it by giving better reward. For example, giving reward when one side is threatening the other side. But my initial hope was that the AI could learn to know a situation is threatening and defense itself.