Model-based Reinforcement Learning Part 2: Model-Based RL

4 min readDec 4, 2017

Big Disclaimer: A great reference for both this post and understanding model-based reinforcement learning is Chelsea Finn’s slides from the Berkeley Deep RL Bootcamp. She does a great job explaining model-based reinforcement learning, but just as before, I will do my best to keep the math out of it for as long as possible.

To recap from our last post, in a reinforcement learning task, our agent requires an environment to wander and act in. In this environment, our agent receives a state, and is then asked for an action to execute. After executing an action, the environment gives the agent a next state, as well as a reward. In model-based reinforcement learning, the goal is to not only optimize the policy to maximize reward, but also to estimate the transition probabilities, p(s’ | s, a). With this, we can learn how our agent might transition if we take action a while we are in state s.

Slide explaining the loop using a model-based reinforcement learning algorithm. From Chelsea Finn’s lecture on Model Based RL in the Deep RL Workshop.

A natural question then becomes, why do we need to know these transition probabilities?

These transition probabilities (which can be interchanged with the phrase environment model) give the robot the ability to simulate experience, kind of like how human beings imagine what might happen if they skipped class, switched car lanes, or ate an extra slice of pizza. A good model of the environment can help an agent understand what might happen if it took certain actions in certain states, all without actually interacting with the environment itself. The better the model of the environment, the less we need to interact with it to optimize our policy.

As many reinforcement learning papers test algorithms that run in environments such as Atari, Minecraft, and other simulation-based methods, model-free algorithms are usually a bit more common. Things like the Policy Gradient Theorem allow model-free algorithms to optimize policies based on the return received from the environment alone, and since we are in simulation, there is no real harm in performing wrong, incorrect, or dangerous actions.

In physical applications of reinforcement learning, specifically robotics, there is a cost to performing these types of “incorrect” actions. Robots in the real world can be damaged, and environments in real life are not easy to reset (especially when compared to the env.reset() interface of OpenAI Gym). With a good environment model, a robot can improve it’s policy through simulation without ever physically interacting with the system. When its learned policy becomes good enough, or technically, converges to a local optimum, the robot can be deployed in the physical world and perform well, despite never have being run in the physical world. Now, in real life, we don’t usually have that good of an environment model, but as we’ll see in the coming posts, there are ways to iteratively build a pretty solid one.

Now that we know how we can use them, and what they do, we can start to move onto the fun stuff: describing how these models are actually built. Before we end this current post, there are a few more benefits to model-based reinforcement learning that are good to keep in mind.

Model-based methods are much more efficient.
Reinforcement learning can be broken into quite a few different classes, but generally, there are gradient-free methods, model-free methods, and model-based methods.
Gradient-free methods, the most famous being evolutionary strategies, use random perturbations to find good policies. Scalable, but extremely sample-inefficient.
Model-free methods directly optimize the policy based on return. Order(s) of magnitude higher sample-efficiency (depending on what class of method is used), but still not great.
Model-based methods, which use an iterative simulate / perform cycle that we will discuss more in detail in the next post, are the most efficient by far.
Model-based methods can transfer.
As we dive further into the details of popular model-based reinforcement learning algorithms such as Guided Policy Search, we will see some applications of this.

In the following blog posts, we will aim to explore the benefits of model-based reinforcement learning, progress in the field, and current active areas of research. Reinforcement Learning, especially the model-based approach. If your interest has piqued, keep reading!

This post is the Part 2 of a few, in which we will try to approach what’s called Model-free Reinforcement Learning from a less-mathy perspective.

Part 1: Introduction
Part 2: Model-based RL
Part 3: RL Formalism

Model-based Reinforcement Learning Part 2: Model-Based RL

Written by Bhairav Mehta