Not Another RL Tutorial!

Part 1: A Whirlwind Tour of RL

Luc Gendrot
12 min readSep 9, 2018

“Oh no.” I hear you say as you click another post that purports to teach you reinforcement learning. “Not another one.”

Well sorry to disappoint, but welcome to yet another odyssey through the wonderful world of Reinforcement Learning. This time, though, your host is just as clueless as you!

Actual Photo of Me

I’ve been trying to learn RL — and I mean really learn RL — for months now. In an effort to curb my typical habit of watching a lecture series on YouTube, convincing myself I know the material, and then eventually realizing I don’t remember anything about it because I never did the exercises or coded a single line, I have decided to write a series of blog posts to complement my learning. After all, the greatest way to learn is to teach, right?

Also the best way to get an answer on the internet is to be wrong, so I’m hoping all of the potential glaring misconceptions I put out will be corrected by everyone’s favorite hate-mob. Have at me!

This series may not be useful — as ultimately this is an exercise intended more for me than for you, dear reader — but I write it publicly in the hopes that it may help someone.

The text basically amounts to a prettier version of the notes I took while watching lectures on YouTube and reading Sutton and Barto, the canonical textbook covering RL.

In terms of structure, I’ll be following David Silver’s Lectures nearly minute-by-minute. I will also s̶t̶e̶a̶l appropriate some (or all?) of the images used in his slides.

Forgive me, David.

My hope is that this series ends up as a decent written companion to the lectures, both for others watching Silver’s lectures and for me coming back to the material should I ever need a refresher.

As I said I’m pretty lazy when it comes to self-study and textbook exercises, so I’m also going to force myself to update a github repo full of implementations of some of the contrived examples Silver uses, and (maybe) some contrived examples of my own.

This first post will attempt to cover the material in the first lecture, which is basically just a summary of the rest of the course. If you’ve read a million of these sorts of articles feel free to skip to the next post, where we’ll start talking about the real stuff.

So, without further ado, let’s begin.

Framing the Problem: What is RL?

Reinforcement learning (RL) is an area of machine learning, inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

Wikipedia

Put simply, the goal of RL is to build an agent that learns to make optimal decisions with (often incomplete) information about its environment. This, in a nutshell, is the reinforcement learning problem. But before beginning to talk about decision making, we must get on the same page about a few high level concepts: an agent’s environment, the rewards, and the observations it receives.

Lecture 1 slides, p. 17

Rewards

Rewards are arguably the most important single value in reinforcement learning. It’ll become clear later that they form part of the most important theorem in RL, but for now let’s just get a basic understanding of what they’re used for.

For every action the agent takes in the environment, it receives a scalar reward signal. Intuitively, this reward gives the agent some indication of how good each of its actions were. Attempting to maximize the total reward an agent expects to receive leads to good behavior because of something called the reward hypothesis, which states that any goal an agent is expected to carry out can be described as the maximization of its expected cumulative reward, provided the reward is well-designed.

The reward hypothesis is a slightly controversial claim, and in practice developing reward schemes that reflect precisely what you want your agent to do can be very complicated. Despite these challenges the reward hypothesis is one of the central conceits of the reinforcement learning paradigm, so it’ll be treated as the unassailable truth going forward.

Consider as an example of an RL problem: a robot tasked with navigating a maze. Say the robot receives -1 reward for every step it takes where it isn’t at the goal (the end of the maze), and +0 for reaching the goal. If the agent were to obtain the maximum amount of reward possible, it would implicitly have exited the maze in as few steps as possible. Given, the maximum reward in this case is a negative number, but that’s a fairly common occurrence.

The Environment

“The Environment” is mentioned frequently in the coming sections, so it’s important to distinguish the environment from other concepts central to the RL problem.

The environment is the physical (or virtual) space in which the agent operates. Taking the famous Atari game-playing agents for example: the environment would be the game emulator itself and all its internal code and game logic. As another example: a self-driving car’s environment would be the real world, and all the complicated physics that implies.

The agent has no control over the environment. For all intents and purposes the environment is a black box to the agent.

Observations

Agents need to get information about the environment in order to make decisions on how to act. They do this by taking in observations of the environment.

These observations are generated by the environment, and interpreted by the agent. They can be thought of as representations of the environment: sometimes perfect (like in the contrived examples we’ll see later in the series), but most often imperfect.

In the case of a self-driving car the agent’s observations might be numerical readouts from various sensors fitted onto the car (LIDAR, GPS, Cameras, etc).

In the Atari example the observations might amount to screenshots of the game being played, the same as a human would see.

Agents

The agent receives observations of the environment, and uses those observations to select actions, which affects the environment and ultimately causes a reward signal to be given to the agent.

Since the agent and all of its components are essentially the only parts of the RL problem under the developer’s control, its makeup is particularly important for us to understand.

These 4 pieces: Agents, Environments, Observations, and Rewards, are just some of the pieces necessary to tackle the RL problem. Digging deeper will require a rigorous mathematical framework with which to solve the problem generally, so let’s start talking about the math.

States and the Markov Property

In the course of operation, the agent generates a “path” through an environment, which consists of sequential sets of observations, rewards, and next actions for each step the agent takes. Grouping these items together forms what is rather intuitively called the history.

The history of an agent/environment pair includes all of the observations, rewards, and actions generated up to a particular point in time. Naturally this means the history contains all of the information available to the agent throughout its lifetime.

It’s unnecessary (and often impossible) to store the entire history of an agent, so instead it’s useful to consider the concept of state. The state is formally just a function that produces a summary of the history up to a certain point.

The state can be thought of from both the perspective of the environment (environment state) and the agent (agent state).

The environment state is the environment’s private, internal representation of its own history. For example: the environment state of an Atari emulator would be all 1024 bits of data that control the internal logic of the game. This set of numbers, along with whatever action the agent chooses, fully characterizes the next step the environment will take, and thus what observation and reward will be provided to the agent next.

The environment state is typically not visible to the agent. For example, in the Atari example the agent works based off of screenshots and not directly on the 1024 bits inside the emulator.

In the case of a self-driving car the environment would be the entire physical world. The environment state in this example contains everything about the physical world, from the wind patterns in Minnesota, to the tide level in Costa Rica. All the possible next configurations of the real world are determined by the current value of these attributes. However, for a self-driving car in San Francisco none of those attributes are relevant (not to mention nearly impossible to gather or work with) so it operates with observations built from its local environment by its sensors.

The agent state, in contrast, is built entirely by the developer and by the algorithm used to solve the reinforcement learning problem. Just like the environment state the agent state summarizes the history, but instead of determining how the environment evolves the agent state summarizes the information needed for the agent to make its next decision.

These definitions of state are rather abstract and don’t tell us much about a state’s formal properties. So a more mathematical definition of state would be helpful.

Luckily early 20th century mathematicians were pretty smart, so we don’t have to look too far.

Ideally any state, being a summary of the history, would contain all the useful information about the history. I.E: The state shouldn’t lose any information from step to step. Additionally, just like with the history, it shouldn’t be necessary to consider a long sequence of stored states (that would defeat the purpose of summarizing the history in the first place).

And so, to meet these conditions, we require that the states be Markov states which exhibit the Markov property:

Lecture 1 slides, p. 21

Which reads: “A state is Markov if and only if the probability distribution over next states given the current state is equal to the probability distribution over next states given all previous states.”

Put simply, this means the current state should be all that is needed to reason about the next state. The future is considered to be independent of the past given the present, making the present state “a sufficient statistic of the future”.

Despite our wish to not store the full history, it could be considered a Markov state in and of itself, albeit an unwieldy one. It’s sort of tautological but if the state is equal to the full history then knowing the current state provides full history, as do all the previous states, so it would only be necessary to remember the most recent state…which would contain the whole history.

See? Kind of tautological.

As a better example: The environment state is by definition Markov, because it’s exactly what the environment uses to decide on the next step. The 1024 bits inside an Atari emulator determine all the next possible configurations of 1024 bits that follow, given an action by the player. In other words, the emulator doesn’t need to know what set of 1024 bits it had 20 minutes ago to know how to proceed right now, it only needs to know the current configuration of bits and what action the player just took.

Remember the developer doesn’t control the environment state, just the agent state, so it’s up to the developer to create agent states that exhibit the Markov property.

In a fully observable environment, the agent state is the same as the environment state. If the Atari emulator’s internal 1024 bits were used as the agent’s state instead of screenshots of the game, it would be a fully observable environment, since the Agent would know as much about the environment as possible. This grants the agent state the Markov property “for free”, but is impossible in most real-world scenarios.

In a partially observable environment the agent state must be somehow computed from the history, since the observations are now incomplete representations of the environment. The agent could use the full history of observations just like we discussed above, or it could use the previous N observation values, or it could use a recurrent neural network to keep a running summary. The choice is ultimately up to the developer.

Possible Representations of State in a Partially Observable Environment (Lecture 1 slides, p. 27)

Aside: The lecture already contains a great example of why the choice of agent state is important in a partially observable environment, so I recommend giving that a gander. The example lasts about 3 minutes, and starts at 47:55.

Anatomy of a Reinforcement Learning Agent

The question still remains: how does the agent use its state to make decisions? What are the internal components of an agent that make this analysis possible?

The answer: one or more of the following: a policy, a value function, or a model. An agent may consist of any combination of those three components, as long as it has at least one it can solve the RL problem.

Parts of an RL Agent: Lecture 1, Page 25

Policy

“What action should I take in this state?”

An agent’s policy is a function that maps states to actions.

A policy can be deterministic or stochastic, where a deterministic policy outputs an explicit action given a state, and a stochastic policy outputs a probability distribution over actions which can be sampled from, again given a state.

Value Function

“If I act this way, how good is the state I am in right now?”

“If I act this way otherwise, how good is this action when I’m in this state?”

State-Value Function
Action-Value Function

The value function can either consider the state alone, or the state plus a candidate action. In each case the value returned is an estimate of how much future reward the agent can expect to receive if it is governed by our current policy.

The value function is also parameterized by a discount factor (gamma in the above formula). This discount factor determines how far into the future to “care” about the rewards. A discount factor of 1 gives as much importance to future rewards as immediate rewards, and a discount factor of 0 means that only immediate rewards are considered.

For a cool look at a value function in action, check out the Value Prediction section of OpenAI’s summary video on their Dota bot(s).

Model

“What’s going to happen next if I do this?”

Finally we have the agent’s model of the environment. This is the agent’s learned conception of how the environment works. Given a particular state and an action to take, the agent’s model will give it a prediction of the next state.

The model can also predict the reward given a state and action pair.

P is a probability distribution over next states given the current state and an action and R is an expected reward for a particular state action pair.

Categorizing Agent Types

All of the above pieces make up the machine learning portion of reinforcement learning. That is, they all require training (essentially trial and error by the agent) to be useful. An agent can learn any combination of these three attributes, and each learned combination categorizes the agent differently.

The space of RL Agent types

An agent can be policy based, learning a policy directly without learning a value function.

An agent can be a value based, where it has no explicit policy, but instead learns a value function and acts based on its estimates. (Technically it still has an implicit policy: choose the best action).

Or an agent can be an actor-critic, having both a learned policy and an estimated value function for that learned policy.

Finally any of those types of agents can be model-based or model-free agents, using its model of the environment to improve its learning of either of the other two components.

Conclusion

This concludes our whirlwind tour of the Reinforcement Learning space, I hope you enjoyed the ride.

This was largely just a summary post, providing just a taste of all of the concepts covered in the coming posts. Don’t worry if some of these concepts or their utility aren’t totally clear yet. There will be many more opportunities to understand them as the series wears on. Too many, probably.

In the next post we’ll start working towards solving the RL problem by first considering an idealized version. We’ll build and get a feel for the theoretical framework used in RL, and we’ll start writing some code to better understand some of the concepts.

--

--

Luc Gendrot

freelance data scientist, big fan of puns, proficient at Mavis Beacon Teaches Typing