Markov decision process: policy iteration with code implementation

16 min readDec 19, 2021

In today’s story we focus on policy iteration of MDP. We are still using the grid world example from the book Artificial Intelligence A Modern Approach by Stuart Russell and Peter Norvig.

The code in this story is part of our MAD from scratch project where MAD stands for machine learning, artificial intelligence, and data science. The complete code used in this story can be found in this repo: https://github.com/clumsyhandyman/mad-from-scratch

In our first MDP story, we showed that not all policies are equal. Statistically, some are better than others. For instance, we examined three randomly generated policies by repeatedly running each one of them for 10,000 times. On average, the total reward achieved by each policy per run is -1.7, -1.2, and 0.7, respectively. Clearly, the third policy is better than the other two. Our question is how can we find the best policy among all possible ones?

In our grid world example, we have 9 states and 4 actions, resulting a total of 4⁹ possible policies. Among all these policies, how can we search for the best one? Clearly it is impractical to try all of them out. Instead of brute force, we can use policy iteration to find the optimal policy.

Policy as a path of a tree

In our grid world, a normal state has a reward of -0.04, a good green ending state has a reward of +1, and a bad red ending state has a reward of -1.

Now imagine we are standing in our initial state s = 8 and we need to decide which action to take: either go right to s = 9 or go up to s= 4. The state s = 9 gives us a reward of -0.04 and so does the state s = 4. They appear to be the same to us. However, by intuition, we know that they are not the same. Whichever state that is closer to our good ending state should be better than the other one.

Our issue here is that we are too shortsighted. We only compare the immediate reward of each state and thus we cannot know which one leads to a better path in the long run. What we need is to look farther down the path and compare which one can lead us to a better future.

In some sense, we can think of our choices of actions as a tree where each state is a node and each action is a branch. In our example, the initial state s = 8 is the root node and the 4 possible actions are the 4 branches. For simplicity let’s neglect the stochastic of MDP for a moment. Our reward of choosing the action of UP is -0.04, so are the choices of RIGHT, DOWN, or LEFT. So, if we only look at the first level of our tree, choosing any action results the same reward.

The real differences happen somewhere down the tree, maybe several levels below the first level. Because MDP is sequential, different states in current level results different possible states in the lower levels. To make better choices, we need to look farther down the tree and pick more preferable leaf node. It is very similar as playing chess: to make a better move, you need to think through the outcome of several successor moves in the further rather than only focusing on the immediate outcome of a single move.

Here we are facing a backup problem: we need to reflect the rewards of nodes down the tree backup along branches so we can choose the branches that lead to these preferable nodes. Let’s see how we can do that.

Utility of states

To perform backup, we introduce a concept called utility or value. Each state has a utility that represents how good it is in terms of leading to a good future path.

In some way, reward is like the profit of a company in the current year. In contrast, utility represents a score we give to each company to reflect how successful they will be in the future. Two companies may have similar annual profits right now, but one company may become more successful than the other in ten years because it is in a more promising market section. Therefore, two companies may have same reward but different utility and we should definitely invest into the company with a higher utility, given our estimation of utility is well reasoned.

Now our question becomes how to map a utility to each state. For now, let’s continue to neglect the stochastic of MDP. Imagine we are standing in s = 8. If we go to s = 9, we receive a reward of -0.04; if we go to s = 4, we also receive a reward of -0.04. However, if we choose s = 9, it may lead us to the path of [9, 10, 11, 7] corresponding to rewards of [-0.04, -0.04, -0.04, -1]. Since MDP has the property of additive rewards, the total reward along this path is thus -1.12. In contrast, if we choose s = 4, it may lead us to the path of [4, 0, 1, 2, 3] corresponding to rewards of [-0.04, -0.04, -0.04, -0.04, +1] with a total reward of 0.84. In a simplified approach, we can choose s = 9 or s = 4 as our next state based on whether -1.12 or 0.84 is preferable to us. Here -1.12 is the utility of state s = 9 while 0.84 is the utility of state s = 4. We usually use v to denote the utility of states.

Of course, finding utility of states is not this easy since different actions lead to different states. As a result, we are not sure which path we will achieve following a given state. For instance, we may move along the path [9, 10, 11, 7] or [9, 10, 6, 2, 3] or even [9, 10, 6, 2, 1, 0, 4…] after s = 9. To calculate the utility of s = 9, among all these possible paths, which one should we use?

To overcome this difficulty, let’s simplify that by saying that we are looking for the utility of a state under a given policy. For now, let’s use the first random policy from our previous MDP story. For simplicity, we still assume there is no stochastic.

Under these assumptions, for any state, the utility value is the summary of the reward of itself and all the future states following that policy. For instance, if we look at s = 6 and we follow the policy, we have the path [6, 10, 11, 7] in front of us. Therefore, the utility of s = 6 is the sum of r(6), r(10), r(11), and r(7).

For any state s at time t = 0, its utility can be written as

Now we have v(6) = r(6) + r(10) + r(11) + r(7). By the same taken, the utility of s=10 is v(10) = r(10) + r(11) + r(7). Combing these two we have v(6) = r(6) + v(10). This is valid for any state and its successor states.

In other words, the utility of a state is the immediate reward of that state plus the utility of its successor state following the given policy.

Discounted rewards

Now let’s look at another state s = 0. Following the policy, the future states are [0, 1, 2, 1, 2, 1, 2, 1, 2, …] which is a loop of 1 and 2 forever. If following our definition of utility, v(0) = r(0) + r(1) + r(2) + r(1) + r(2) + …, we would get negative infinity.

Even if there is no loop in our path, we may still get infinity as utility values. For instance, if the rule of our game changes to we can continue after we reach the ending state, then for every state, the utility is either positive infinity or negative infinity. To address this issue, we introduce the concept of discounted rewards.

The key idea of discounted rewards is that the reward received in the future is less valuable than the reward received right now. It is similar as the economy in real life. Because of interest rate, one dollar you received right now is more valuable than one dollar you receive 10 years in the future. In MDP, we use a discount factor γ to mimic the interest rate.

Before we have additive rewards, which is

Now using discounted rewards, we have

Using a discount factor γ between 0 and 1, our utility value is guaranteed to be bounded. Because we have finite number of states, we have finite number of rewards. Therefore, we are guaranteed to have a maximum value among these rewards. Then the utility value of any state is bounded by

With discounted factor, the relation between utility of a state and the utility of its successor state is

Add back stochastic

We have defined the utility in a deterministic environment. Now it’s time to add stochastic back into our game. In deterministic environment, when we look at s = 6 we have a determined path of [6, 10, 11, 7] and thus v(6) = r(6) + r(10) + r(11) + r(7). In a stochastic environment, even if we follow the policy, we are not guaranteed to move along the path [6, 10, 11, 7]. To be more specific, we only have a probability of 0.8³ = 0.512 to actually move along this path.

Looking at other paths, we have a probability of 0.1 to move along the path [6, 7], a probability of 0.04096 to move along [6, 10, 11, 10,11,7], and all other probabilities for all other paths. The utility of state s = 6 is the expected reward following the probability distribution of all these possible paths.

Generalize this, we can define the utility in stochastic environment as

where E is the expectation with respect to the probability distribution over the possible paths following a certain policy.

With stochastic, the utility of a state under a certain policy can also be represented as the sum of its immediate reward and the utility of its successor state, of course, following a probability distribution.

For instance, if we look at s = 6, following our policy which is go DOWN, we have a 0.8 probability to reach s′ = 10, 0.1 probability to reach s′ = 7, and 0.1 probability to stay in s′ = 6. Therefore, the utility of s=6 is

Policy evaluation

For our given policy, now we are ready to determine the utility values for each state. We already have

Similarly, we can write this equation for each state:

Since we know the reward of each state and γ is an arbitrary factor between 0 and 1 that we decide, we have 11 equations and 11 unknowns here. Problem solved. We can determine the utility of each state by directly solving these equations. This process is called policy evaluation: for a given policy, we evaluate that policy by determining the utility of each state.

Alternatively, instead of directly solving the linear equations, we can use a dynamic programming approach called iterative policy evaluation or sweep. We first initialize the utility of each state as zero, then we loop though the states. For each state, we just update the utility using the following equation.

For instance, if we set γ as 0.5 (more about that in later part of this story) and we loop through the states from s = 0 to s = 11, we have

This is called in-place sweep. In-place means we use the updated utility immediately in this same sweep. For instance, we updated v(1) as -0.04 and then we immediately used that value as v(1) when calculating v(2).

We continue with another sweep:

We repeat sweeping like this for several times until the changes of utility values between consecutive sweeps are marginal. Here is the utility we get after 11 sweeps.

These utility values are approximate solutions of the linear equations we listed above. For instance, let’s check state s = 6. The utility should be

Here we have

In summary, our sweep approach is an alternative method of solving linear equations. We can get the utility values for a given policy using either method.

Policy improvement

After we get the utility for this policy, it is time to improve the policy. As we talked before, utility represents how good a state is. When choosing actions for a state, we should prefer successor states with higher utility. For instance, if we look at state s = 2, potential successor states include 1,3, and 6. Clearly s = 3 is the best choice as it has the highest utility of 1.999. Another example, if we look at state s = 8, potential successor states include 4 and 9 with utility of -0.0814 and -0.1110 respectively. As such, s=4 is preferable than s=9.

Since our MDP is stochastic, selecting a preferable successor state does not guarantee we will reach it. Rather than successor states, what we should compare is actions. For state s = 6, we have four possible actions. If we choose the action of UP, the expected reward is

Similarly, if we choose to go LEFT, we have

If we go DOWN, we have

If we go RIGHT, we have

Comparing the outcomes of these four possible actions, clearly UP is the best choice. Therefore, we should update the policy of the state s = 6 to UP.

We perform this process for each state:

In our example, after policy improvement, our policy looks like this:

Policy iteration

We are not done yet. We do get a policy better than our initial one, but it is not necessarily the best one. What we need is to repeat this whole process of policy evaluation and then policy improvement again and again. When we perform policy evaluation again, we do not need to initialize the utility to all zero. Instead of that, we just continue with the utility values we already have. This should help to converge faster.

What policy evaluation does is to update the utility values while keeping policy constant (from π[i], V[i -1] to π[i], V[i]). What policy improvement does is to update the policy while keeping the utility values constant (from π[i], V[i] to π[i + 1], V[i]). When π[i] is the same as π[i + 1], there is no need to continue and we call that convergence. In this iterative process, we start from a randomly generated policy π[1] and eventually converge at the optimal policy π[6].

Effects of discounted factor

Now it is the time to talk about discounted factor. Previously we arbitrarily set it as 0.5. What does that that imply? What if we choose 0.1 or 0.9?

Discounted factor γ is used to discount the rewards of future states when adding them together.

If we choose 0.1 as γ, we have

In contrast, if we choose 0.9, we have

In the case of γ = 0.1, the reward we received in the future has little effect on the utility of a state. For instance, the reward received at t = 3 has a factor as small as 0.001. Changes of that reward barely makes any difference in the utility. On the other hand, when γ =0.9, the reward at t = 3 has a factor of 0.729. Changes of that reward may considerably change the utility.

In summary, smaller γ emphasizes the immediate reward of the current state while larger γ emphasizes long term reward in the future. If we use investors as an analogy, an investor with a smaller γ is the one who prioritize companies with the highest current annual profit regardless what their future look like. In contrast, an investor with a larger γ is the one who prioritize companies with brighter future even though they may have low annual profit right now.

Let’s see the effects of discounted factor in our grid world example by comparing three γ values: 0.1, 0.5, and 0.9. After policy iteration, the utility and optimal policy using each γ are shown below.

Utility and policy using three different discounted factors

When γ is 0.1, the utility values tend to be dominated by the immediate reward of a state. For instances, if we stand at s = 8 and we look around, the two neighboring states s = 4 and s = 9 both have a utility value very close to -0.0444. It is not ultra clear which one is better. This is the issue of shortsighted: when we ignore the future and only look at the immediate reward, these two states appear to be equally good. In contrast, when γ is 0.9, the utility values tend to incorporate states faraway in the future. Now, utility of s = 4 is 5.4953 while utility of s = 9 is 4.0850. It is crystal clear s = 4 is a better successor state.

With these different γ values, although their utilities look quite different, the optimal policies they got are actually very similar. This shows one important characteristic of MDP: the absolute value of utility of states does not matter that much; what matters is the relative rankings of utility between states.

When we determine the policy, we are performing an argmax function whose output only depends on ranking not on absolute values. No matter what magnitude these values are, so long as the correct choice has the largest value, we will achieve the same optima policy.

Using the same Monte Carlo approach to assess these optima policies corresponding to different γ values, we can find that their performance is close.

Rewards of 10,000 repeated runs using different discounted factors

For γ of 0.1, 0.5, and 0.9, after 10,000 repeated runs, the mean total rewards are 0.63, 0.68, and 0.70 respectively, within a difference of around 11%. However, the minimum total rewards are quite different: -6.04, -3.68, and -1.6. This means the worst-case scenario of γ = 0.1 is much worse than that of γ = 0.9. Therefore, a larger γ is preferable in this problem as it achieves both higher mean and higher minimum total reward.

Nevertheless, everything has a price. Larger γ achieves better results in this problem but pays the price of more computational effort. In these three cases, although they all require around 4 to 5 iterations of policy iteration, γ of 0.9 requires as many as 60 sweeps in one iteration while γ of 0.1 only requires less than 4 sweeps in a single iteration. Larger γ requires more sweeps and thus requires more computational time.

Number of sweeps using different discounted factors

Moreover, larger γ works better in this problem and tends to work better in many problems. However, this does not mean that it works better in every problem. In fact, it may depend on the type of the problems and should be analyzed case by case.

Pseudo-code of policy iteration

To implement policy iteration, first we need functions for both policy evaluation and policy improvement.

For policy evaluation, we use a threshold θ as the stop criterion. When the changes in utility values between two sweeps is smaller than this threshold, we call it convergence.

For policy improvement, in the output, we include a binary variable indicating whether there is any change when updating policies. If the policy remains the same after performing improvement, this binary variable is outputted as false.

Iteratively alternating policy evaluation and policy improvement, we get policy iteration. Whenever the binary output of policy improvement is false, we know the policy is converged and the converged policy is the optimal policy.

Implement policy iteration in Python

For the purpose of learning, we incorporate the plots of learning curves (number of sweeps in policy evaluation and number of updates in policy improvement) into the code. We also set an arbitrary maximum limit of number of policy iterations. The code will generate a random policy if an initial policy is not provided.

We can solve our grid world problem using our code like the following example. After preformed policy iteration solver, we can plot the utility and policy as well as evaluate our optimal policy by repeatedly running it for multiple times.

Using this code, now we can use policy iteration find the best policy among all possible policies for a MDP problem. However, policy iteration is not the only method. Next time, we will talk about another method which is value iteration to solve MDP.