Not Another RL Tutorial!

Part 2: Chains, Rewards, and Decision Processes, Oh My!

22 min readSep 9, 2018

In the last post we took a whirlwind tour of RL and all its myriad parts. In this post we’ll begin to describe how some of those parts are used to form the formal framework we’ll use from here on out: the Markov Decision Process.

Again I should note that this series of posts is based almost entirely on the lecture series by David Silver, and all images have been graciously (and shamelessly) stolen from his lecture slides.

All the code used in this post can be found in the Github repo.

Building up to MDPs

The MDP, or Markov Decision Process is the main formalism used in Reinforcement Learning. That is, almost all RL problems can be represented as MDPs, and the properties of an MDP allow us to reason precisely about RL problems.

Specifically, MDPs allow us to reason about the things that make a good agent: How to decide between several possible actions, how to decide which state we’d rather be in next, and how to find the best possible way to behave in all states.

For now we will only consider fully observable MDPs. As discussed in Part 1, that means the agent state is the environment state, so the agent knows all there is to know about the environment.

As David describes fully observable MDPs:

“One way to think of [fully observable MDPs] is that the current state that is given to the agent completely characterizes [the environment]. So the way in which the environment unfolds depends on some state, and we are told that state.”
— David Silver, Lecture 2

Considering only the case of fully observable MDPs may seem quite limiting considering that in the real world the agent will very seldom have access to the full environment state. However, because partially observable problems can be easily converted into MDPs or POMDPs, they can be treated in largely the same way as the fully observable case.

In order to build up a better intuition of the math behind MDPs we’ll first start with simpler versions of MDPs and keep adding complexity until we arrive at a full description of an MDP. Starting with Markov Chains, then Markov Reward Processes, and finally Markov Decision Processes.

Markov Chains

As a brief refresher: the central idea underpinning all types of Markov processes is the previously discussed Markov property. The Markov property states:

Essentially: the next state depends only on the previous state, and not all of the states that came before it. We don’t need to know the full history of states to know what will happen next, just the current one.

So the probability of transitioning into any next state s’ in a Markov process is dependent only on the current state s. We call this the state transition probability.

Probability of transitioning from state S into State S’

If we gather up the state transition probabilities for all possible states into a square matrix, we get what’s known as the state transition matrix:

For example: the position `(1, 3)` (or row 1 column 3) corresponds to the probability that the agent will transition **from** state 1 to state 3.

Where each entry corresponds to the probability of transitioning from the current state (row) to a successor state (column). Since a single row describes all the probabilities for a single current state, each row should sum to 1.

This transition matrix is actually all we need to describe a Markov Chain, the first step on our journey to full Markov Decision Processes:

A Markov Chain has no rewards or actions yet, it simply defines a random process that generates a set of states, each with the Markov property.

With the State Transition Matrix, we can implement a very simple Markov chain example in Python based on the Student MDP example given in David Silver’s lectures. With this implementation we can generate random sets of states sampled using the state transition matrix.

>>> chain.sample_pretty("C1")
['C1', 'C2', 'C3', 'Pub', 'C1', 'C2', 'Sleep']
>>> chain.sample_pretty("C1")
['C1', 'C2', 'C3', 'Pass', 'Sleep']
>>> chain.samply_pretty("C1")
['C1', 'C2', 'C3', 'Pub', 'C2', 'C3', 'Pub', 'C3', 'Pass', 'Sleep']
>>> chain.sample_pretty("C1")
['C1', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'C1', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'FB', 'C1', 'FB', 'C1', 'C2', 'C3', 'Pub', 'C2', 'C3', 'Pass', 'Sleep']

Remember that the agent isn’t capable of making decisions yet, it is totally at the mercy of chance. Building on the Markov chain towards an agent that can make decisions first requires the inclusion of rewards. As we’ll see, these rewards ultimately allow us to grade the “goodness” of states.

Markov Reward Processes

A Markov reward process is exactly like a Markov chain, except each step taken generates a reward.

Our Markov chain must now be augmented to include rewards for each step.

Rewards are given every time the agent leaves a state. And now that rewards are included, trajectories we sample will have an associated return value.

Note that the return value is discounted according to a variable gamma between 0 and 1. Gamma determines how important future rewards are to the agent.

A gamma close to 1 has the agent making no distinction between immediate and future rewards, and a gamma close to 0 has the agent considering immediate rewards only. We’ll consider a gamma value of 1 for the rest of this post, to make the math convenient.

If we add rewards to our previous implementation of Markov chains, we get a Markov reward process. It generates trajectories just as before, but each trajectory now has a return value determined by gamma.

path = chain.sample("C1")
print(path)
>>> ["C1", "C2", "C3", "Pass", "Sleep"]chain.G(path, gamma=1)
>>> 4chain.G(path, gamma=.9)
>>> 1.87

Since the return is a random value that varies from sample to sample, it makes sense to ask what the expected return is for a state. Or rather, what is the average return we can expect to get from this state forward?

This expectation over future rewards has a special name: the value function.

Intuitively, it is preferable to be in a state that has a higher state value, because the agent expects to get more reward from that point onward compared to a state with a lower value.

Thanks to the law of large numbers, a rough estimate of the true value function for a state can be achieved by sampling many trajectories starting from that state, and averaging the sampled returns:

rewards = []
for i in range(0, 20000):
    # Start in "Class 3"
    path = chain.sample(2)
    reward = chain.G(path, 1)    
    rewards.append(reward)# Estimate of the expected reward for "Class 3"
print('{0:.1f}'.format(np.mean(rewards)))
>>> 4.3

Taking 20,000 sample trajectories takes about 5 seconds for “Class 3” on my 2012 Macbook Pro and with a gamma of 1. The resulting average return for “Class 3” is ~4.3 and the following graph — which shows the true state-values— shows this to be an accurate estimate. For now simply trust that these are the correct values, but know that much of the rest of the series will be devoted to methods for computing these values.

The Bellman Equation

Discerning readers might notice that the expanded equation for the value function provided above can be represented in terms of itself:

Since gamma is constant, it can be pulled out of the equation starting after t+1:

Recognize the equation in parentheses? It’s the return at t+1.

And finally, thanks to the law of iterated expectations (don’t worry, I only vaguely understand this last step myself, just take my word for it):

In other words, the value of a state s is the expectation over the immediate reward plus the discounted expected future reward.

Which makes sense, right? It’s sort of trivial when you spell it out, but the amount of reward we’ll receive from our current state is simply the reward we receive right now plus the expected reward from the next state.

But how do we get that expected reward of the next state v(St+1)?

With our pre-solved state-values from above, and the transition probabilities from our state transition matrix, it’s easy to calculate the expected reward for the next state as the sum of all state-values weighted by their probability of occurring.

Immediate reward + Expected reward in the next state

Of course many of the values in this sum will be zero, because you can only transition to some states at each step.

Notice that the immediate reward is no longer expressed as an expectation because it’s a constant, and the expectation of a constant is just a constant, so we can pull it out of the expectation.

This formula — which expresses the state-value in terms of itself — is known as the Bellman equation.

The structure of the Bellman equation can be concisely represented with a simple 1-step look-ahead diagram:

Nodes b1 and b2 are possible next states reachable from node a

In essence we’re taking the immediate reward r and then averaging over the values of all possible successor states.

As a sort of sanity-check, we can use our pre-solved values from above and check that they’re self-consistent with the Bellman Equation.

Perhaps unsurprisingly, it works out.

Solving an MRP Analytically

I alluded earlier to there being different ways of solving for the “true” state-values. One such way is to solve the Bellman Equation directly.

Conceptually this solution requires some limited knowledge of linear algebra, but it should be noted that it is totally infeasible for large problems due to its computational complexity (O(n³) where n is the number of states).

First, vectorize the Bellman Equation:

And then do a bit of algebra:

And voila, the solved value function v.

We can implement the above equation in python to get the true values for our student MRP:

import numpy as npstate_names = ["C1", "C2", "C3", "Pass", "Pub", "FB", "Sleep"]_rewards = [-2, -2, -2, 10, 1, -1, 0]p_matrix = [[0, .5, 0, 0, 0, .5, 0],
            [0, 0, .8, 0, 0, 0, .2],
            [0, 0, 0, .6, .4, 0, 0],
            [0, 0, 0, 0, 0, 0, 1],
            [.2, .4, .4, 0, 0, 0, 0],
            [.1, 0, 0, 0, 0, .9, 0],
            [0, 0, 0, 0, 0, 0, 0]]gamma = 1
R = np.array(_rewards)
P = np.matrix(p_matrix)
I = np.identity(len(p_matrix))solution = np.dot(np.linalg.inv((I-gamma*P)), R)
solution = solution.tolist()[0]for state in range(len(state_names)):
    print(state_names[state], solution[state])>>> C1 -12.543209876543214
>>> C2 1.4567901234567908
>>> C3 4.320987654320986
>>> Pass 10.0
>>> Pub 0.8024691358024683
>>> FB -22.543209876543223
>>> Sleep 0.0

So, now we can see that while the shown solutions did round to 2 significant figures, they are indeed accurate. Try solving the Bellman Equation with these exact numbers just like before to prove that these numbers are self consistent.

It’s like magic, except it’s math.

The only problem is that most real-world problems are going to be too large to feasibly solve like this in time for the heat death of the universe. Later on we’ll look at some iterative solutions to the Bellman Equation which scale realistically.

Markov Decision Processes

Now, after much ado we arrive at the final iteration of Markov processes (at least for our purposes), and the primary formalism used in reinforcement learning: The Markov Decision Process (MDP).

An MDP differs fundamentally in just one way from MRPs: It introduces actions.

The inclusion of actions means that the transition probabilities as well as the rewards must now be conditioned on a chosen action. We can express this in the same way as before, except there is now a separate state transition matrix and reward for every action.

Pub is no longer a state, but an action that introduces stochasticity. “Pass” is no longer needed either, as we get rewards for taking actions, not just leaving states.

For this particular MDP, every action save for “Pub” is deterministic. If you choose to study while in “Class 1”, you will transition from “Class 1" to “Class 2” with 100% certainty.

In the case of the action “ Pub”, the agent no longer enters into the “Pub” state, but instead transitions with some probability into one of the 3 “Class” states.

Why reduce the “Pub” state to an action instead? To show that actions can have both deterministic and stochastic effects on where the agent ends up in the environment. As one student points out in the lecture: you can imagine a similar black dot in between each of the other state transitions but with a probability of 1 instead of 0.4 or 0.2.

Notice also that we also don’t need the “Pass” state anymore. Whereas in the MC and MRP examples we only got rewards for exiting states, we now get rewards after taking actions, so we can get the +10 reward for studying in Class 3 without needing to transition into a “Pass” state that immediately takes us to “Sleep”.

Policies

Whereas previously fate alone determined how the agent traversed the environment, enabling the agency part of our agent requires the agent to choose between different actions. We need a function that tells the agent what action to take depending on the state. We call this function the agent’s policy.

A policy takes in a state and maps it to an action (or rather a probability distribution over actions).

Given the agent’s current state, the policy outputs a choice of action in the form of a probability distribution from which to sample. Of course, the policy can be made deterministic by simply giving a 100% probability to the action it thinks is best, but we’ll see in the future how stochastic policies can be useful.

Next up we’ll see how choosing a policy affects the value function, and how calculating the value function under different policies allows us to compare policies and ultimately choose better ones.

But first, a quick backtrack into the world of MRPs.

Sidebar: Converting MDPs to MRPs

An interesting consequence of having a defined policy is that we can use the policy to reformulate any MDP into an MRP, which we already know how to solve analytically.

Imagine we have a policy that chooses all actions randomly with equal probability. In the Student MDP, this just means that no matter what state the agent is in each action has a 50% probability of being chosen.

We can convert our MDP under this uniform-random policy into an MRP by changing our action-dependent rewards and transition probabilities to be action-independent.

The rewards in the MDP are conditioned on which action is taken, so we need to average over all actions from a single state to get a single action-independent reward value.

The MRP reward for leaving a particular state under certain policy

That is, the reward for leaving a particular state in our Student MDP-turned-MRP becomes the expected reward from following our policy.

Likewise, the transition probabilities need to be given the same treatment:

The MRP transition probabilities for each state

Applying these formulas to every state in our Student MDP, and using the same MRP implementation from before, we get:

state_names = ["C1", "C2", "C3", "FB", "Sleep"]# Probabilities changed to reflect uniform random policy# Notice Class 3 probabilities reflect possible "Pub" choice:
# (.5 * .2) = probability of picking pub (.5) AND
# probability of then being sent to class 1 (.2)# Together they mean a total probability of (.5 * .2) = .1 
# for ending up back in C1 from C3p_matrix = [[0, .5, 0, .5, 0],
            [0, 0, .5, 0, .5],
            [.1, .2, .2, 0, .5],
            [.5, 0, 0, .5, 0],
            [0, 0, 0, 0, 0]]# Action rewards are also weighted and summed by probability of occurring
# I.E: 5.5 = (.5 * 10) + (.5 * 1)_rewards = [-1.5, -1, 5.5, -.5, 0]gamma = 1
R = np.array(_rewards)
P = np.matrix(p_matrix)
I = np.identity(len(p_matrix))solution = np.dot(np.linalg.inv((I-gamma*P)), R)
solution = solution.tolist()[0]
for state in range(len(state_names)):
    print(state_names[state], solution[state])>>> C1 -1.3076923076923084
>>> C2 2.6923076923076925
>>> C3 7.384615384615385
>>> FB -2.3076923076923075
>>> Sleep 0.0

Any MDP with a defined policy can be converted into an MRP like this.

MDP Value Functions

Just like with Markov reward processes, the principle unit of measurement that we care about in our Markov decision process is the value function. However, introducing actions into the mix makes things slightly more complicated.

Namely because the value function for a MDP can now be looked at from the perspective of the state, or the action, and in either case is conditioned on the agent’s chosen policy.

The state-value function tells us how much reward we can expect from the current state onward, if we follow the chosen policy.

We can get decent estimates for our state-values by sampling from the MDP many times and averaging the results, just like with the MRP.

The action-value function tells us how much reward we can expect from this state after we have first taken a specified action, then following the current policy from then onward. Later we’ll see how this measure allows us to compare different actions based on their action-value and ultimately select the best one.

Again, just like the state-value function, we can use an average over many samples to estimate the action-value function.

MDP Bellman Equation

Just like in the MRP case, we can decompose both forms of the Bellman Equation into the immediate reward for the current state plus the discounted value of whichever state we end up in:

Where At+1 is chosen by the policy

In either case they represent the same underlying value: the expected future reward.

In order to understand the relationship between these two forms of the Bellman Equation it’s useful once again to consider a one-step look ahead diagram, which this time includes both states and actions:

**States** = Open Circles; **Actions** = Closed Circles

The state-value function is the average of all the available action-values, weighted by how likely we are to choose them under our current policy.

Similarly, the action-value function is the average value of the states the agent might transition to given the chosen action, but weighted instead by their transition probabilities:

S = possible next states; P = transition probabilities

So the action-value is the immediate reward for taking that action, plus the expected value of the next state, over all possible successor states.

We can expand this diagram to two steps in order to get the MDP Bellman Equations:

And we can substitute in the expanded action-value formula from above:

In parentheses: The action-value function

We can also start from an action node:

And similarly expand the action-value function by substituting in the state-value function:

In either case we get the value function in terms of itself, giving us the MDP Bellman Equation.

We can use the above formulas to do a quick sanity check on the results of our analytical solution to the Student MDP. Remember that our gamma is 1 in this example:

State value function again for reference

Again unsurprisingly the numbers come out to be self-consistent.

An implementation of the student MDP, along with value function estimates and conversion into an MRP, can be found in the Github repo.

Optimal Value Functions

Evaluating the value functions of this dummy uniform-random policy is all well and good, but it’s obviously not the best policy we could pick. What we really want is to work towards the best policy, the one which gets us the maximum possible reward.

In order to find this “best policy”, we’ll first need to define what that means, and to start we’ll define the concept of a maximum expected reward over all policies:

The optimal state-value function tells us the maximum amount of reward we can extract from the system starting in a specified state over all possible policies. So out of all the policies we could possibly think of, all the myriad ways in which we could traverse our MDP, the optimal state-value function tells us the maximum amount of reward we could ever expect to receive.

The optimal action-value function tells us the maximum future reward we can expect after we’ve committed to a chosen action first, and again over all possible policies.

We define “solving” an MDP as finding the best possible policy, which would have the agent choosing actions such that it receives the maximum possible reward. Period.

So if we find the optimal action-value function, we have essentially solved the MDP. Since if we had the optimal action-values we could always choose the action with the highest future reward, implicitly giving us the optimal policy.

Think of it this way: If you were in some state and had two options “go left” or “go right” and you knew that the absolute maximum amount of reward you could ever get after choosing “left” was 10, and the maximum you could ever get after choosing “right” was 50, which action would you choose?

Well duh. You’d choose to “go right” of course, because you’d want to go down the path that leads to the most future reward.

Let’s look at the (pre-solved) optimal state-values for the student MDP:

The student MDP is pretty simple, so it’s fairly intuitive to figure out the optimal values just by eyeballing it. If the agent started in “Class 3”, the absolute maximum amount of reward it could expect would be +10 from choosing to “Study”.

Starting in “Class 2” we know that taking the “Sleep” action will net the agent precisely 0 reward from now and forever into the future, but the “Study” action will give us a negative reward of -2, and land us in a state with an already known maximum value of +10, for a maximum total amount of reward possible from “Class 2” of +8.

And so on and so forth.

Let’s also look at the optimal action-values.

Once again starting in “Class 3” if the agent were to commit to the “study” action, the most reward we could get out of that in the future is an immediate reward of +10, and a future reward of 0, for an optimal action-value for “study” of +10.

Looking at “Class 2”, if the agent committed to the “study” action it would get an immediate reward of -2, and end up in a state which would yield a maximum of +10 reward in the future, for an optimal action-value of +8. If instead the agent chose to take the “sleep” action, it’d get an immediate reward of 0, and end up in a state with a maximum possible future reward of 0, for a maximum total reward of 0 for taking the “sleep” action.

Optimal Policies

Great so we have some understanding of what it means to have an optimal value function, but we still haven’t figured out what it means have the optimal policy.

In order to make progress towards that goal we need to be able to say when one policy is better than another.

Partial ordering over policies

One policy is better than (or equal to) another policy if its value function is greater than or equal to the other policy’s value function for all states. In other words one policy is better than another if, no matter the state, you get at least as much reward as in the other policy.

This means that to be a better policy, a policy can’t be worse than another policy in any state. It has to get at least as much reward in all states to say it’s an equally good policy, and more reward in at least one state to say it’s a better policy.

It turns out that for any MDP there always exists a single (but not necessarily unique) policy which is better than or equal to all other policies. There will never be a situation where one policy is better in some states but not in others.

Put another way: there is always at least one optimal policy for any MDP.

Additionally, once evaluated the optimal policy will always achieve the optimal value functions. Everything works out as intuitively as you’d expect: The best possible policy gets the best possible rewards.

It also turns out, as we saw on a small scale in the “left-right” example earlier, one way to get the optimal policy is to simply use the optimal action-value function to choose the best actions:

Below is our student MDP, again labeled with the true optimal values, but this time also with the optimal policy in red.

Just as we’d expect, the optimal policy is the one with the highest optimal action-value associated with its actions. For example in “Class 1” the value of the “Facebook” action is 5, and the value of “Study” is 6, meaning we can expect a higher future reward after taking the “Study” action, so that’s what the optimal policy chooses.

Finding The Optimal Action-Value Function

So we want to find the optimal action-value function, and use that to generate the optimal policy, but how do we actually find it?

Just like with the non-optimal case that we’ve been working with so far, it turns out that to solve for the optimal state-value and action-value, we need to expand their equations out so they’re written in terms of themselves.

And again just like with the non-optimal case, we can do so with the lookahead diagrams we’ve grown to love so much:

The arc between action nodes represents the MAX function

This time instead of taking the expectation over the value of the next actions under our policy we take the MAX over possible actions because we want to select the best action, not simply calculate an expectation under a policy.

Likewise we can find the optimal action-value by averaging over the best possible reward available in whatever states the environment might push us to after taking an action:

Why the average instead of the maximum in this case? Because as we’ve seen, the agent has no control over where the environment blows it after taking a specified action. So the expectation of future reward has to be based on the transition probabilities.

With this information we can do a sanity check with our pre-solved student MDP if we want:

The optimal state-value of “Class 3” is +10, because it’s the max between +8.4 and +10.

10 = MAX(10, 8.4)

We can also check the action-value of the “Pub” action. The immediate reward is +1, and after taking the average of each state-value:

1 + (0.2*6 + 0.4*8 + 0.4*10) = 9.4

Doing this for all the states and actions and we find that the pre-solved answers are once again self-consistent.

And of course to get these equations in terms of themselves we can stack the two equations and get the recursive relationship we saw with the regular value functions. Starting from either a state node or an action node:

In principle, getting these self-referential optimal value functions (called the Bellman Optimality Equations) should allow us to solve for the optimal action-value and state-values.

Unfortunately, unlike the earlier case where we were solving for the value function under a specific policy, solving for the optimal value functions has no closed-form solution due to the non-linear MAX function present in both versions. No magic linear-algebra to save us this time.

There are, however, many iterative solutions to the bellman optimality equation, which will be the subject of the next few posts…

Conclusion

In this post we built up our understanding of Markov Chains, Markov Reward Processes, and finally Markov Decision Processes. In the process we were introduced to several forms of the most fundamental equation in RL: The Bellman Equation.

We saw with Markov Chains how to generate sequences of random states from a set of transition probabilities.

Then with Markov Reward Processes, we saw how adding in rewards allows for the calculation of state-values, which measure how much expected future reward an agent will get from a certain state onward.

With the inclusion of actions and policies in Markov Decision Processes, we learned that the value functions of an MDP can take two forms: the familiar state-value, and the newly introduced action-value. With the action-value representing the expected amount of reward an agent will get after committing to a particular action and following its current policy from then onward.

Policies — functions which map states to actions — tell the agent how to behave, and affect the value functions accordingly. Allowing for the evaluation of different action-taking schemes by calculating the resulting value functions.

Finally, by defining a partial ordering over policies, we can formally compare different policies with he ultimate goal of finding the optimal policy.

The rest of the posts in this series will detail the different ways of iteratively improving value functions and policies such that they eventually converge on the their optimal forms. Starting with dynamic programming in situations where we know the transition dynamics of the system we’re working in, and moving on to model-free methods for cases when we don’t know exactly how our actions may effect the transition from state to state.

Stay tuned!