Model-based Reinforcement Learning Part 1: Introduction

7 min readDec 4, 2017

Machine learning can be broken up into a seemingly innumerable amount of research areas. Yet, while machine learning has advanced many fields, a lot of the progress seems irrelevant to understanding how decision-making in our own brain actually occurs. Many of the amazing advances in applications areas like medicine and computer vision can be summarized as master pattern-finders, and while useful, they don’t seem to learn the same way as human beings do. Even “biologically-inspired” deep neural networks are extremely data-inefficient when it comes to learning patterns from training data, where as a human requires only one experience to learn that a flame is hot and that water is wet.

Humans don’t need too many reminders that flames are hot, after the first time.

Reinforcement learning, which is a branch of machine learning, aims to learn from directly interacting with the environment, rather than from training data which is precollected and often prelabeled as well. Reinforcement learning offers the lure of being “evolution-inspired,” as agents (the actors in an environment), like humans, use trial-and-error to learn about the world. While not nearly as sample-efficient as humans, over time, agents learn to do things that reward them the most, and eventually, learn how to operate in their world. If this is confusing, let’s make things a bit clearer.

The environment, at each step (which can be thought of as a frame advance in a video, or a second in time), asks the agent (the thing interacting with the environment) for an action that it would like to take, given what state the agent is currently in. The agent then uses a policy to determine what action it should take, and then takes that action and applies it in the environment.

A visualization of how states turn into actions.

For example, if my environment was a game show like Jeopardy, then my current state might be seeing what questions are remaining, with Alex Trebek asking what action I would want to take next.

After performing the action in the environment, the environment gives the agent a new, updated state, as well as a reward signal, a scalar that encodes the previous action’s success. The goal for the agent is through continued interaction with the environment, to maximize the accumulated reward signal over time.

Right off the bat, there are a lot of questions which need answering, like the ones listed below

What should we do in order to maximize accumulated reward over time?
How do we use our past experience in order to achieve an optimal policy?

But, before we can address these research questions (or any of the others that are being heavily researched in reinforcement learning), we might need to nail down a few more of the technical details.

Reinforcement learning, as we know, requires our agent to act with the environment in order to receive a reward signal. The agent uses a policy, which can be though of as a way to pick an action when given a state. The optimal policy can be thought of as the policy which maximizes the probability-weighted sum of future rewards. Finding optimal policies is the main goal of reinforcement learning. An optimal policy for an agent is like a trying to head north when you have compass; no matter which way you turn, or what state your in, you always know the best route forward. An optimal policy will tell you what action to take in any state to maximize your return, and there are many challenges to finding this optimal policy, as we will continue to discuss in more detail in subsequent posts.

Where does the probability-weighted part come from? In many problems, there are what are called transition probabilities, in the form of T(s, a, s’), which is interpreted as: What is the probability that we transition to next-state s’ given that we take action a in state s.

Going back to Jeopardy, the state may be the questions remaining (and maybe some details about myself, i.e my history knowledge, number of questions I got correct, etc.), and the policy would be the internal-decision making process that I use before I tell Alex what card I’d like next. Here, our transition probabilities will be filled with only ones and zeros, as he is not going to give me a question I didn’t ask for.

Reinforcement learning problems are often posed as Markov Decision Processes (MDPs), which are problems that say that can be formulated as having the next state dependent only on the current state and current action. Environments with this property are said to have the Markov property, and are pretty useful in both real-life and reinforcement learning. Intuitively, the Markov property says that all the information about what happened previously is wrapped up and summarized in the current state, and that the state you will be in next depends only on the current state and the action you end up taking. For example, to play chess, you don’t need to know (or remember) every move and board configuration that has been taken since the start of the game; you just need to look at where you are now, and start figuring out what’s best to do from there.

Now that we know about reinforcement learning problems and their structure, how can we use them to tackle each of the two questions posed above? Let’s take them step by step.

Question #1: What should we do in order to maximize accumulated reward over time?

Initially, our agent doesn’t have a lot of knowledge about the environment, so we have to think of a way to give it some. In model-free reinforcement learning, what we do is simply start having the agent interact with the environment and collect experience. While this series will not focus on these methods, there are a lot of cool things that you can learn from just experience using things like policy gradients.

In model-based reinforcement learning, we have a bit more of an advantage. This series will answer this question from a model-based reinforcement learning perspective, but we can summarize them a bit here.

In model-based reinforcement learning, the model refers to having a model of the transition probabilities, which can be thought of as having a model of the environment. If we know how likely we are to move from state to state when taking a certain action, we can start to build up an estimate of how the environment works. We can build a model the same way model-free approaches work: by interacting with the environment.

Why do we need a model? Interacting with an environment might be fine in video games, but in applications like robotics, interacting with an environment, especially with an untrained policy, can be extremely expensive in terms of time and safety. With a model of the environment, an agent can imagine how it’s policy would do, and iteratively improve the policy. There are many approaches to this, and this series will start with some toy optimal-control environments and move into discussion about how model-based RL can solve real problems in things like robotics.

Guided Policy Search, a popular model-based RL method, being used to teach robots at Google.

Question #2: How do we use our past experience in order to achieve an optimal policy?

In all of machine learning, the goal of training algorithms is generalization, and reinforcement learning is no different. Generalization is the ability to perform well in states or inputs not encountered yet. An machine learning model that generalizes should be able to tell you that a picture of a Yorkie should be labelled dog, even though it was only trained on pictures of Golden Retrievers and Black Labs. In reinforcement learning, generalization aims to find a policy that performs optimally, even when the agent did not encounter that particular state in its experience.

To aid generalization, the reinforcement learning community has moved away from tabular methods to function approximators, of which the most popular is by far neural networks. Tabular methods are methods in which each state might get a cell in an array, but as you can guess, this does not scale well to real problems, where the state space might be massive or even infinite (i.e continuous state spaces). Neural networks, and clever advancements in architectures, allow for us to extend reinforcement learning to both high-dimensional state and action spaces. We will get into this more later in the series, but if your background on neural networks is lacking, Michael Nielsen’s online course would be a fantastic refresher.

In the following blog posts, we will aim to explore the benefits of model-based reinforcement learning, progress in the field, and current active areas of research. Reinforcement Learning, especially the model-based approach. If your interest has piqued, keep reading!

This post is the Part 1 of a few, in which we will try to approach what’s called Model-free Reinforcement Learning from a less-mathy perspective.

Model-based Reinforcement Learning Part 1: Introduction

Written by Bhairav Mehta