Practical Reinforcement Learning pt. 1

Published in

Coinmonks

9 min readSep 13, 2018

Introduction

For the next few articles I want to start exploring how RL might be made applicable for home robotic applications like the MARVIN bot featured in this blog. I am going to start with what I hope is an intuitive introduction to the ideas behind Reinforcement Learning and then work my way towards an approach that I feel is compatible with a practical application. Once that’s been found, I’ll implement this scheme on MARVIN itself to see how well it performs on an actual robot.

Let’s get started!

Reinforcement Learning

Before diving into the approach I wish to explore, it makes sense to provide a brief introduction to Reinforcement Learning (RL) for those who may not already be familiar with it.

Put in the broadest terms I am able, RL involves training an agent to learn to perform an action based on the rewards (or punishments) it receives from its environment. Training a dog is the best example I have seen for understanding the practical implementation of this task. When a dog is being trained(assuming one is using modern positive training techniques as opposed to force based training) it is rewarded for each correct action with a treat of some form. For example, you ask the dog to sit, it sits, you provide it with an immediate reward. If it does not sit, it does not get a reward.

RL works on a similar premise — a reward is provided for each correct action the agent makes based on its environment. (We can also provide negative rewards if we so choose). The agent starts with no knowledge of which actions are will lead to the best reward and is tasked with learning to maximize the amount of reward that it receives.

A concrete example can be seen in the image above, taken from Udacity’s Deep RL project. The goal here was to get the arm’s gripper to touch the green cylinder. Rewards where provided for touching the cylinder and for moving the gripper steadily towards it. The Agent controlled the joint velocities of various joints on the robotic arm, while receiving input from a camera pointed at the scene.

Let’s take a quick look at how RL work under the hood. It’s simpler to think about the discrete case first. In the discrete case, the agent goes through the following cycle:

It senses or receives some kind of input from its environment.
It performs an action in the environment.
It receives a reward (possibly zero) from the environment.

For the purposes of this discussion, I’ll refer to the data that the agent collects during the “sense” phase as the agents state. This information includes any data that the agent would have access about both its own configuration and the configuration of the environment.

Looking at the problem from a discrete standpoint, the entire history of the agents experience in the environment can be thought of as a sequence of (state, action, reward) triplets. If we want to know the total amount of reward an agent has gotten so far, we just need to add up the sum of all of the rewards in its history.

How can we utilize this fact? Let’s look at a very simple toy problem — the Maze shown in the image above. Our intrepid maze solver finds themselves positioned at the diamond with two possible choices of action. The green choice will eventually lead them to the green reward and vice-versa with the red choice.

If we know the rewards before hand, we can calculate the return for each choice. In this case the image has been annotated with the appropriate rewards so we can see that that:

Return for Red: 75
Return for Green: 21

It should be clear, that if the agent wants to maximize its Return, it should choose the Red option. Looked at another way, moving up earns us 75 points, moving down earns us 21 points. We want to move up!

Some key assumptions where made here though — notably that we knew with with perfect information not only the full structure of the environment, but what every reward would be before hand. That’s a huge advantage that we simply can’t rely on having in any kind of practical application.

In the real world we’d have to estimate the Return for each of the actions based on our previous experience. Let’s take a look at another map to see how that can change things:

In our new scenario, the agent only has experience with two choices — what happens when it goes up exactly once, and what happens when it goes down. It does not know what the results of the actions after movement should be. Based purely on the information on hand, if the agent wants to maximize it needs to exploit the fact that the downwards direction has the highest known score, and move towards the green arrow.

That uncovers a new problem however, what if finding the red dot is more valuable than all of the possible steps leading from down action? Our current strategy of naively choosing the highest scoring option we can see would actually lead to sub par score. If we want to find the best option, we are going to have to perform some kind of exploration.

How do we decide when to explore and when to exploit? The most common method is to do this probabilistically. With this approach the agent starts by randomly exploring each node for 100% of its actions. As time goes on — based either on the number of times the agent has seen the environment, or some other measure appropriate to the task — this value is gradually lowered with the agent making random moves some of the time and exploiting the highest available reward the rest of the time.

That leaves the question however — what exactly is the agent “learning”. What’s being learned is relationship between the states, the rewards, and the actions. In a typical example, the agent starts with a clean slate — it has no knowledge of the rewards available in the environment.

Let’s ignore the exploration / exploitation problem for the moment and suppose our agent is searching the environment systematically. It explores the grid squares above, below, to the left, and to the right of the green circle getting the rewards shown in the following image:

After one iteration then, the agent has learned that from the “green dot” state if that:

Moving Up gets 15 points
Moving Left gets 11 points
Moving Down gets 8 points
Moving Right gets 12 points

That’s simple enough, let’s take it one step further. Suppose after a few more iterations the agent explores the yellow squares, getting the shown rewards in the diagram below:

We now have to answer a new question — what reward should assign to the Move Up action based on the current information? We know that after one “move” the agent scores 15 points. What about the second move? Well we can assume that agent is trying to maximize its current score, so it would want to choose the option worth 6 points.

Adding those up, we see that the total reward for the “Move Up” action in this case should be 21 points. We could follow a similar strategy to fill in the rest of the values in the grid always assuming that the agent will be maximizing its total reward. I’ve provided an example worked one more step in the future to drive the point home:

New Rewards provide a change in behavior

As can be seen from the updated grid, a bit more information causes the agent the change its pathway through the world. Where previously we had a path of “UP, LEFT”, we can now see that path has morphed into “UP, UP, RIGHT”.

This mapping, where we assign an action to take at each state is called the policy. One way to look at what RL is doing is to think of it as learning a policy for the agent to follow. Learning in this case involves constantly updating the policy whenever new information is found.

There are few things we can say about policies:

We can assign them a score based on the Return, for example given the above the policy (UP, UP, RIGHT) is valued at 33 points, but the policy (UP, LEFT) is only worth 21 points.
We can compare policies based on their score — here we can say that the policy (UP, UP, RIGHT) > (UP, LEFT) since the former policy gives a higher return than the later.
We can formulate a definition for an optimal policy — which is the set of (state, action) pairs that result in a return that no other policy can do better than.

We are almost to the end of part one of this series, there is however one more flaw in the system we haven’t addressed yet — how far ahead should we allow the policy to go?

Let’s return to the grid for a further example — what if instead of stopping at (UP, LEFT), we extended the previously “inferior” policy further (UP, LEFT, UP, RIGHT, RIGHT). This new policy would then score higher than the previous one since it was able to take advantage of scoring more rewards than the others. Now we could decide that all policies *must* have the same name of actions, so that if we start at one place we always take exactly 27 moves, but that’s not very representational of the real world. In the real world, there is not pre-defined limit to how many actions we can take (at least in a practical sense).

To get around this, we can introduce the concept of a “discounted return”. The idea here is that we give preference to rewards that are “closer” to our current state over those that are further away. The image below should help illustrate this point further:

In the above we’ve discounted the reward by 50% for each step taken by the agent. The yellow triangles represent how much discounted reward was added to the reward gained for entering a given square in the grid, while the white triangles represent the total reward earned (working backwards from the end of the policy). The 50% value here is refereed to as the discount factor.

Here we can see that we are heavily valuing towards rewards that are close the agents current position, while steeply discounting those positions that take longer to reach. In fact in the example, the values of each additional step would very quickly get close to zero.

This discount factor is typically a constant that the agents designer sets for the agent in the world. It allows us to model how much of an effect far away states should have on the agents decision making, typically ranging between zero and one. One in this case meaning that all steps should be equally weighted (which was our practice in this article before introducing the discount factor), and zero representing that only the immediate next steps should have any bearing on the agents decisions.

I’ve provided yet another image to help visualize the effect that the two extreme choices of the discount factor can have the agents behavior in the diagram below. On the left we have the green arrows showing the path the agent would take through our little grid world if the discount factor was set to zero, while on the right we see the different path that would be taken if the discount factor was set to one.

That’s enough for one article — in my next article I am going expand on what been explained so far to start building a simple RL problem that may be of interest to a personal robot.

Until then,

Share and Enjoy!

References

Below are some good references that helped me along the way with learning the concepts behind Reinforcement Learning:

Reinforcement Learning, an Introduction Sutton and Barto 2017

Udacity’s Reinforcement Learning Course

Get Best Software Deals Directly In Your Inbox

Practical Reinforcement Learning pt. 1

Introduction

Reinforcement Learning

References

Written by Gene Foxwell