What is Model-Based Reinforcement Learning?

Our monthly analysis on machine learning trends

Published in

the integrate.ai blog

8 min readOct 1, 2018

This post was originally sent as our monthly newsletter about trends in machine learning and artificial intelligence. If you’d like these analyses delivered directly to your inbox, subscribe here!

Machines learn differently than people. For instance, you probably didn’t learn the difference between a positive and a negative movie review by analyzing tens of thousands of labeled examples of each. There is, however, a specific subfield of machine learning that bears a striking resemblance to aspects of how we learn.

Reinforcement learning (RL) is a field that’s been around for a few decades. Lately, it’s been picking up steam thanks to its integration of deep neural networks (deep reinforcement learning) and the newsworthy successes it’s accumulated as a result. At its core though, RL is concerned with how to go about making decisions and taking sequential actions in a specific environment to maximize a reward. Or, to put a more personal spin on it, what steps should you take to get promoted at your job, or to improve your fitness, or to save money to buy a house? We tend to figure out an optimal approach to accomplish goals like these through some degree of trial and error, evolving our strategies based on feedback from our environment.

At a basic level, RL works in much the same way. Of course, backed by computing power, it can explore different strategies (or “policies” in the RL literature) much faster than we can, often with pretty impressive results (especially for simple environments). On the other hand, lacking the prior knowledge that humans bring to new situations and environments, RL approaches also tend to need to explore many more policies than a human would before finding an optimal one.

Google DeepMind’s research demonstrates how a humanoid figure can learn to
maneuver around a simulated environment

As reinforcement learning is a broad field, let’s focus on one specific aspect: model-based reinforcement learning. As we’ll see, model-based RL attempts to overcome the issue of a lack of prior knowledge by enabling the agent — whether this agent happens to be a robot in the real world, an avatar in a virtual one, or just a piece software that take actions — to construct a functional representation of its environment.

While model-based reinforcement learning may not have clear commercial applications at this stage, its potential impact is enormous. After all, as AI becomes more complex and adaptive — extending beyond a focus on classification and representation toward more human-centered capabilities — model-based RL will almost certainly play an essential role in shaping these frontiers.

“The next big step forward in AI will be systems that actually understand their worlds. The world is only accessed through the lens of experience, so to understand the world means to be able to predict and control your experience, your sense data, with some accuracy and flexibility. In other words, understanding means forming a predictive model of the world and using it to get what you want. This is model-based reinforcement learning.”
Richard S. Sutton
Primary Researcher at the Alberta Machine Intelligence Institute

To Model or Not to Model

“Model” is one of those terms that gets thrown around a lot in machine learning (and in scientific disciplines more generally), often with a relatively vague explanation of what we mean. Fortunately, in reinforcement learning, a model has a very specific meaning: it refers to the different dynamic states of an environment and how these states lead to a reward.

Model-based RL entails constructing such a model. Model-free RL, conversely, forgoes this environmental information and only concerns itself with determining what action to take given a specific state. As a result, model-based RL tends to emphasize planning, whereas model-free RL tends to emphasize learning (that said, a lot of learning also goes on in model-based RL). The distinction between these two approaches can seem a bit abstract, so let’s consider a real-world analogy.

Imagine you’re visiting a city that you’ve never been to before and for whatever reason you don’t have access to a map. You know the general direction from your hotel to the area where most of the sights of interest are, but there are quite a number of different possible routes, some of which lead you through a slightly dangerous neighborhood.

A state graph from a paper on RL approaches for simulated urban environments

One navigational option is to keep track of all the routes you’ve taken (and the different streets and landmarks that make up these routes) to begin to create a map of the area. This map would be incomplete (it would only rely on where you’d already walked), but would at least allow you to plan a course ahead of time to avoid that neighborhood while still optimizing for the most direct route. You could even spend time back in your hotel room drawing out the different possible itineraries on a sheet of paper and trying to gauge which one seems like the best overall option. You can think of this as a model-based approach.

Another option — especially if you’re the type of person who’s not big on planning — would simply be to keep track of the different locations you’d visited (intersections, parks, and squares for instance) and the actions you took (which way you turned), but ignore the details of the routes themselves. In this case, whenever you found yourself in a location you’d already visited, you could favor the directional choice that led to a good outcome (avoiding the dangerous neighborhood and arriving at your destination more efficiently) over the directions that led to a negative outcome. You wouldn’t specifically know the next location you’d arrive at with each decision, but you would at least have learned a simple procedure for what action to take given a specific location. This is essentially the approach that model-free RL takes.

As it relates to specific RL terms and concepts, we can say that you, the urban navigator, are the agent; that the different locations at which you need to make a directional decision are the states; and that the direction you choose to take from these states are the actions. The rewards (the feedback based on the agent’s actions) would most likely be positive anytime an action both got you closer to your destination and avoided the dangerous neighborhood, zero if you avoided the neighborhood but failed to get closer to your destination, and negative anytime you failed to avoid the neighborhood. The policy is whatever strategy you use to determine what action/direction to take based on your current state/location. Finally, the value is the expected long-term return (the sum of all your current and future rewards) based on your current state and policy.

In general, the core function of RL algorithms is to determine a policy that maximizes this long-term return, though there are a variety of different methods and algorithms to accomplish this. And again, the major difference between model-based and model-free RL is simply that the former incorporates a model of the agent’s environment, specifically one that influences how the agent’s overall policy is determined.

A Modest Comparison

So what are the pros and cons of the model-based vs. the model-free approach? Model-based RL has a lot going for it. For one thing, it tends to have higher sample efficiency than model-free RL, meaning it requires less data to learn a policy. In other words, by leveraging the information it’s learned about its environment, model-based RL can plan rather than just react, even simulating sequences of actions without having to directly perform them in the actual environment.

A related benefit is that by virtue of the modeling process, model-based RL has the potential to be transferable to other goals and tasks. While learning a single policy is good for one task, if you can predict the dynamics of the environment, you can generalize those insights to multiple tasks. Finally, having a model means you can determine some degree of model uncertainty, so that you can gauge how confident you should be about the resulting decision process.

Moving to the cons of model-based RL (or the pros of model-free RL), one of the biggest ones is simply that by having to learn a policy (the overall strategy to maximize the reward) as well as a model, you’re compounding the degree of potential error. In other words, there are two different sources of approximation error in model-based RL, whereas in model-free RL there’s only one. For similar reasons, model-based approaches tend to be far more computationally demanding than model-free ones, which by definition simplify the learning process.

It’s worth noting that this doesn’t necessarily need to be a binary decision. Some of the most effective recent approaches have combined model-based and model-free strategies. Perhaps this isn’t so surprising given the evidence that, as one paper states, “the [human] brain employs both model-free and model-based decision-making strategies in parallel, with each dominating in different circumstances.”

Recent Research and Frameworks

A few years ago, OpenAI released an open source RL toolkit — OpenAI Gym — that provides a variety of flexible environments for algorithm experimentation and development. A team at Google has now followed suit, releasing its own Tensorflow-based framework this past month. Beyond just aiding experimentation, the Google team has also stated its explicit goal to help make RL research more reproducible (and, by extension, more accountable to the larger community). Rather than providing an environment, the Tensorflow framework instead offers acodebase of agents (four in all) as well as their training data, allowing users — even those with limited RL experience — a convenient entry point into experimental research.

Speaking of research, a number of recent papers have demonstrated the practical and theoretical possibilities of model-based RL. One of the takeaways from these papers is just how broad the current applications of RL are. They include everything from robotics to natural language processing to a seemingly endless range of simulated environments.

This broad applicability certainly isn’t limited solely to model-based approaches. Still, on a more philosophical level, what makes model-based RL so compelling is partly the fact that it more faithfully mirrors the procedures we use to learn. We take it for granted, after all, that knowing more about our environment and its likely responses to our actions is useful for making good decisions. In a sense, model-based RL has simply figured out a way to mathematically formalize this basic human insight.

What This Means For You

Model-based RL isn’t quite ready for primetime production contexts now. Still, it’s already become integral for developing algorithms that can handle complex sequences of events in both real and simulated environments. This is one of those areas of machine learning that’s worth keeping track of in the coming years: when model-based RL finally breaks into commercial settings, it’s going to make some serious waves.