Practical Reinforcement Learning pt. 2

Introduction

Gene Foxwell
Coinmonks
Published in
6 min readSep 21, 2018

--

In the previous article in this series I covered some of the intuition behind Reinforcement Learning (RL) — what it is, and what it does. In this article I want to start to look into how to apply the ideas behind RL with a very simple problem inspired by a home robotics scenario — Marvin’s World.

Marvin’s World

Marvin’s world is a very simple game, staring Marvin who we see at the top left corner of the grid. Marvin is carrying a beverage (or other object of interest) and has been instructed to bring it back to its owner.

Of course we could easily solve this problem with any number of off the shelf path finding algorithms — but for the purpose of this article I want to look at this from the perspective of RL. How can we design an RL system that can solve this problem?

Formulating the Problem

There are, in my view, three main components that we need to specify in order to describe an RL problem:

  • Agent: We need to know what actions it can take, and what inputs it will accept. For example a sorting robot may have full control over a complex robotic arm while accepting inputs from an on-board video camera.
  • Environment: This is where the Agent will “live”, this can either be a simulation, or if you have a physical robot it can be the real world. All inputs to the Agent come from the environment.
  • Rewards: These are, in essence, how we control what the Agent will try to do in the environment. For example if we could provide the Agent with a high scoring reward for solving a maze, and low scoring (possibly negative reward) for getting lost and taking too long.

How does this relate to Marvin’s world? Well the environment is given to us in form of the grid shown — we have Marvin in the top left, a couple of end tables that we may want to avoid crashing into, and Marvin’s owner near the bottom right of the grid.

Our agent for this problem will be the titular Marvin. Marvin has four actions it can take: UP, DOWN, LEFT, and RIGHT. Each of these moves Marvin into the square adjacent to it in the selected direction. If no such square exists, then Marvin does not move.

What about Rewards? This bit is tricky — and where a lot of the creativity and iteration will come to play. For now, let’s start with the Reward’s demonstrated in the image below:

Rewards!

Rewards have been assigned in the most basic manner — a positive reward is given for reaching Marvin’s owner (at which point the “game” will end, and reset), and a negative reward is given for crashing into the tables (this might damage either Marvin or the table), after which again, the simulation will end and reset to the original start state. Blank squares are assumed to provide a reward of zero.

This gives us a basic formulation of the problem — let’s see how can we can actually solve this.

Searching for Rewards

Right now, Marvin does not have much information about the world, form its perspective the world is essentially a blank grid. With no other information, the only real choice is to randomly choose some actions and see what happens — I’ve demonstrated one possible random path in the image below:

A “random” path

So Marvin does a bit of exploring, then runs into a table. Let’s apply the approach we learned in the previous article and see if we can figure out what the rewards should be based on this particular set of experiences. To keep things simple, I’ll assume that the discount factor is “one” for this demonstration, we can tune this parameter later to see how it affects our eventual solution.

An Updated Model of the World

Backtracking from the table, we get the reward values seen in the above image. Let’s try a few iterations of this idea to see what reward values show up. I’ll illustrate these in the image below.

Note: Any blank grid squares are assumed to have all zero values for their actions. Blank triangles are also assumed to provide zero reward.

Randomly Exploring the Environment

We see as the agent explores the world it discovers more about its world it eventually encounters the various rewards (and penalties) that we’ve spread around it. This information can be used to assign a value to each action the robot can take from a given state. If you have followed the demonstration images provided above you may have noticed the pattern for this is done:

The value for an action is the maximum score that can be obtained by taking that action.

There is another observation we can make about this situation — the two paths that lead directly to the goal both have the same score +1. This is a direct result of how we choose the rewards — we are only rewards or penalizing arrival at the goal / failure positions, there is no indication in our reward system that would differentiate between the length of the paths!

We can fix this problem in a fairly simple manner — just update the rewards! In this case it is desirable for the agent to find the shortest path towards its goal, so we penalize each step it takes. The intuition here is that since the agent is maximizing its reward at each step this should encourage it to give shorter solutions a higher score. The image below demonstrates how the rewards change when we give each blank square a reward of -0.1.

New Rewards!

That’s a bit better, the Agent now has a way to differentiate between the “long way around” and the shortest path! That’s what we want. This also shows just how influential the reward function really is on the Agent’s behavior — the same agent, exploring the same environment, but with two different reward functions finds two completely different solutions!

I want to emphasize this point a bit more, as in my opinion, the reward function is the really the key to a successful RL solution. Essentially, when a RL is being applied what we are really doing is recasting the problem into a new domain. The diagram below illustrates this a little better:

Recasting the problem

This is really what we’ve done here — we’ve taken a problem where the agent needed to find the “best” path in the environment and changed up how we find our solution. The Agent and Environment remained constant, but instead of using a traditional method like A* to find the best path, we have the robot search for rewards and choose a policy that maximizes those rewards.

Put another way: In the traditional solution we have to implement a path finding method, in the RL solution we choose a reward function instead. Deciding when and how to use RL basically boils down to a question of what’s more practical — designing a reward function, or designing a traditional algorithm!

Next Article …

So far these last two articles have concentrated on gaining an intuitive understanding of how to use RL. We’ve yet to see any equations, or any code demonstrating how to put this into practice. In the next article I’ll walk through how we can translate everything we’ve learned into a simple python program that we can use as a stepping stone towards more solving more complicated RL related problems.

Until then,

Share and Enjoy!

Get Best Software Deals Directly In Your Inbox

--

--

Gene Foxwell
Coinmonks

Experienced Software Developer interested in Robotics, Artificial Intelligence, and UX