Model-based Reinforcement Learning Part 3: RL Formalism

7 min readDec 5, 2017

Before we can get to model-based reinforcement learning, we will need to formalize some reinforcement learning concepts in mathematics.

Reinforcement Learning, Formally

Reinforcement learning is often used to solve Markov Decision Processes, or problems in environments which follow the Markov Property, which we saw in Part One defined as:

Problems that can be formulated as having the next state dependent only on the current state and current action.

Every timestep in the environment, we are posed with a decision: given our state, what action to take? After taking an action, we get a reward, and a next state as well. At each timestep, we gather experience, which can be thought of as a tuple.

Experience Tuple

Now, assuming that we have a lot of these experience tuples, we want to solve a simple goal: Maximize the expected reward over time. Mathematically, this is written as so.

Goal of Reinforcement Learning, formally

Here, we are summing over all timesteps, from the initial time to infinity. Inside the summation, we have a term that includes a Gamma (this is called a discount factor, but if we set it equal to one, the term goes away, which is exactly what we will do!) as well as a reward function, which basically formalizes the above paragraph (what reward will I get if I take this action in this state). Our goal for the MDP is to maximize the reward over time, and now that we have introduced our experience tuples, we can learn how to do just that.

Q Learning

In order to maximize this reward, we are going to first look at a popular algorithm called Q-Learning, a reinforcement learning algorithm which learns the action-value function, or the value of taking a particular action in a given state. Our Q function has two inputs: a state, and an action that we would like to evaluate, and it returns the real-valued expected reward of that action (and all other actions after it; we’ll see how this is incorporate below). Intuitively, it tells us how good a particular action is at a particular state.

Q-value function: Given a particular state, tells how good a particular action is in terms of reward.

From Wikipedia: https://en.wikipedia.org/wiki/Q-learning

We can learn Q-functions by starting with the same Q value for every action in every state. As we explore our environment, we can update our Q estimate using the equation seen above, which is basically a weighted estimate between our old estimate and the reward that we get by taking the action, as well as the estimate of the optimal future value (This is what we mean by all subsequent future actions). Now, using this simple formula, we can start to operate in our environment, gain experience, and then update our estimates with the formula above. But, how can we pick actions?

We can use our Q Values!

By picking our actions according to a simple formula most of the time, we can actually learn good estimates of each action in each state. Our simple formula can be written as:

Which means to pick the best action in any state according to our estimate.

But, there’s a problem: what if our estimates are wrong? If we only pick the best action, we will never see anything else and never know if that action truly was the best or not, or in reinforcement learning terms:

Our agent is not able to explore with a greedy policy.

To fix our exploration problem, we introduce a small number called epsilon. Epsilon, a user-set number, is a small number which denotes the probability of taking a random action in any state. This means that in a state, with probability one minus epsilon, we will take the best estimated action, and with probability epsilon, we pick randomly.

What we have just defined is an implicit policy, or a policy which is derived from a state-value or action-value function. Remember, a policy is a mapping from states to actions, or basically a function that tells us how to act in each state. Before we continue onto model-based reinforcement learning, we are going to touch upon another way of defining a policy: by learning it.

Policy Learning

As we saw, a policy helps us figure out what actions to take in what states. Using some math called policy gradients, we can use our experience to directly optimize our policy. Policy Learning methods can have better convergence properties from their implicit-policy counterparts, and are effective in high dimensional or continuous action spaces (hence their widespread use in robotics). Policies can be any function, and as any function is, has the ability to be represented (or, parameterized) by a set of parameters. In our examples, the parameters will be called theta, and the function will be a neural network.

If that doesn’t make too much sense, take a look at this link. And then, this one.

Our policy will tell us what action to take in a state, or more precisely, the probabilities associated with each action for an inputted state.

Policy pi, parameterized by theta

Our goal in policy learning, is that given a policy that is parameterized by theta, find the best theta. Best theta is a vague phrase, but basically, we can formulate the policy search problem as an optimization problem, and then use gradient ascent to maximize our optimization problem.

Formally, we define a policy objective function, J, which is the expected return given by following a policy pi parameterized by theta.

The first term is the stationary distribution of states, or the distribution of states in the environment. The second term is the policy, and the third term is the reward.

The policy gradient, seen below, maximizes this function by searching for a local maximum and ascending the gradient with respect to the parameters.

From David Silver’s Lecture on the subject

Well, What’s the Problem?

It seems that the problem should be solved; both Q Learning and Policy Learning seem to offer solutions to finding a good policy in any environment, so why do we need model-based reinforcement learning?

The problem stems from the equations listed for both learning methods. Among other things, both methods are extremely inefficient, and take a long time to converge to a good policy. On a physical robot, this could mean exploring with random actions for quite some time before any meaningful reward is achieved from which we can start learning. On real systems, this type of uncertainty can be expensive, and is almost never tolerated.

Maybe an untrained neural network, although it does seem to know what it’s wants to do…

Model-based reinforcement learning offers the ability to model the transition function between states. This way, we can use the model internally to estimate the return of taking actions and optimize our policy using our simulated experience. Then, we can run a better (and safer) policy (or often times, another experience gatherer, we’ll see what this means in Part 4) with which to gather experience tuples, better model our environment, and simulate experience to optimize our policy.

Model Based Reinforcement Learning

Now that we see the benefits of model-based reinforcement learning, how can we do it? Luckily, we still have our experience tuples, which makes a lot of things much easier. Note, our example in this post will concern the tabular case, but we will soon see how these ideas can generalize to problems where tabular solutions cannot be applied.

If we want to use model-based Q-Learning, we can pretty easily do so. In our tabular setting, we can remodel our Q-Learning problem like so:

Here, the probability function inside the summation will act as our model, or the probability of moving to another state after taking an action in a starting state.

If we set gamma to one again, we can see that the Q-value of any state is the reward we’d get by taking an action in the previous state, as well as the max Q value of the next state multiplied by the probability of actually getting to that state. In this formulation, we can learn a model pretty easily using a few update rules, similar to the ones we defined in Q Learning.

For the state actually transitioned to.

For any other state.

Here, given enough experience, our model-based and model-free learner defined in the Q Learning section will converge to the same answer.

If the value of a model is not necessarily clear from this simple example, keep reading. In the next post, we will learn about some model-based reinforcement learning algorithms that are applied to physical robots, where mistakes can be costly.

This post is the Part 3 of a few, in which we will try to approach what’s called Model-based Reinforcement Learning from a less-mathy perspective.