Practical Reinforcement Learning pt. 3

Q-Learning Algorithm

Gene Foxwell
Coinmonks
Published in
7 min readSep 28, 2018

--

Get the Cheese!

This article is a continuation of my previous article on the subject. It is recommended that you read the other articles first if you have not already done so.

Q — Learning

At this point if you’ve been following along you should have an intuitive understanding of what the main ingredients of Reinforcement Learning are. There are still a few ingredients left before we can build our own Reinforcement Learning agent. Our next ingredient is Q-Learning.

Let’s start by taking a look at an example from a previous article:

We shall begin by examining the highlighted square in the above example. In the example the reward set for going “DOWN” from the highlighted square is 0.8. For the discussion below, we will refer to this value as “Q”. Where does this value come from? Recall the following details:

  • Each non obstacle and non-person square has a cost of -0.1.
  • We assume that each move we make maximizes the potential reward.
  • If the agent reaches an obstacle, or a goal the episode ends.

What happens then we the agent goes down? Well first it receives a reward of -0.1. Since we are trying to build a formalization here, let’s give that reward a name — “R”. Next it will have two choices, “LEFT” and “DOWN”. Going “LEFT” cost -0.1 and gets an additional “reward” of -1.0 for hitting the table. Going down costs -0.1 and gets and addition reward of 1.0 for reaching the goal.

It should be clear that the maximum value that be obtained from the position immediately below the highlighted square is 0.9 (-0.1 + 1.0). Let’s call that value Qmax. Putting it all together we have:

Q = R + Qmax  (Eq. 1)

We are getting closer, but this still isn’t general enough. To fully generalize this equation we need to introduce some new variables:

  • s: This will represent the current square the agent is in ( we will refer to this as the agents “state” in future examples).
  • a: This will refer to the action taken by the agent — in the case above this action was “DOWN”.
  • s1: This will refer to the state the agent will be in after it takes the action a. In the example this refers to the square beneath the highlighted square.
  • a1: This refers to an action taken when the agent is in state s1.

We can now update our equation:

Q(s,a) = R + gamma * Qmax(s1, a1)   (Eq. 2)

Here Qmax(s1, a1) refers to the maximum score obtainable from all the actions available from the state s1. So in the example this would refer to the maximum of the score obtained by going LEFT (-1.1) or DOWN (0.9). Plugging in the values from the table gives us:

Q(s, "DOWN") = -0.1 + 0.9 = 0.8

Which is consistent with the table assuming a discount factor of 1.

We are nearly to the point where we can formulate the so called Q-Learning algorithm for RL. There is still one piece missing — let’s consider another example:

Graph World

In this “Graph World” the agent starts at the blue circle on the left hand side of the world. It can then move to any adjacent circle that is connected to it by a line. There are two end goals in this world — denoted by the two green circles on the right hand side of the environment. The upper green circle provides a smaller reward than the lower green circle. As with our previous world, we will penalize each move the agent makes with a negative reward.

Now, lets consider what happens if the Agent gets stuck in a cycle during its exploration of the environment:

Going in Circles

This creates a problem — each time the agent travels through the loop it collects more and more negative rewards, which will have a detrimental effect on the final policy. Consider the following situation:

Let’s say the agent transverses the loop in the environment once, before visiting supposedly higher value lower green circle. Since this is a goal position, the agent ends its exploration and starts again. The next time it happens to randomly explore the upper green circle (with a lower value) before actually completing a full loop. In this later case, the agent will have collected less negative reward for moving, which would* result in the return for reaching the upper goal to be greater than for reaching the higher value lower goal!

*: If it didn’t we could just have the agent keep going around the loop over and over again until it does. Since its collecting negative rewards along the way there must eventually be a point where the return for the upper goal is greater than that for the cyclical path taken by the lower goal.

Such a situation isn’t very intuitive — we don’t want our agent’s policy to be so sensitive to the exact path taken during exploration. We can get around this however with a change to our approach.

Up until this point, we’ve been calculating our rewards only after the agent has had a full run through of the simulation. This makes working out the return relatively straight forward — but it makes our approach susceptible to the type of problems demonstrated above. Furthermore, if we have to wait until the episode is complete to update the rewards, how would we handle environments that don’t have predefined “end states”?

Instead of waiting for the episode completes, what if we just updated our rewards a little at a time? How could we do this?

Imagine if you will that there is some value that represents the “true reward” for an action from any given state. Whenever we take an action, we get a new reward, in this case say +1, instead of waiting until the entire episode is complete, we update the score right away. The trick is, instead of completely updating the value to be +1, we simply move the score a little bit towards +1.

This can be seen as climbing a hill. Each time an action is taken, we take a step towards the top of the hill. When we reach the top, we stop climbing and place our flag.

Climbing the Mountain

This can summarized into an equation:

Q(s,a) = Q(s,a) + step_size * ( (R + gamma * Qmax(s1, a1) - Q(s,a) ) (Eq. 3)

Basically this equation says that each time we take an action, we update the reward by adding a little bit (the step size) of the difference between what the reward we got and the reward we expected was. We typically refer to this step size as the “learning rate” as it determines how quickly the system adapts its behavior to changes in the rewards.

Putting everything together from the last few articles we get what is called the Q-Learning algorithm

or more formally:

Q-Learning AlgorithmGiven exploration constant e
Given a learning rate l
Given an Environment S composed of state, action pairs
Given a set of Terminal states T
Given an initial state s0
Given a reward function R
Let Q:S -> Real be the zero function.while s0 is not a terminal state:
Let p <- a random value between 0 and 1
if p < e:
choose a random action from the state s0
else:
choose the action "a" that maximizes Q(s0,a)

Let s1 <- the state obtained by performing action a at s0
Let a0 be the action that maximizes Q(s1, a)
Let r be the reward obtained by taking action a at s0.

Q(s0, a) = Q(s0, a) + l * [r + gamma * Q(s1,a1) - Q(s0, a)]
Let s0 <- s1

Next Time

This article covered what we need to know in order to start coding the Q-Learning algorithm. Next article will take this algorithm and convert it to python so we can get a feel for how it actually works in practice.

Until then,

Share and Enjoy!

Get Best Software Deals Directly In Your Inbox

--

--

Gene Foxwell
Coinmonks

Experienced Software Developer interested in Robotics, Artificial Intelligence, and UX