Markov decision process: value iteration with code implementation

9 min readDec 20, 2021

In today’s story we focus on value iteration of MDP using the grid world example from the book Artificial Intelligence A Modern Approach by Stuart Russell and Peter Norvig.

The code in this story is part of our MAD from scratch project where MAD stands for machine learning, artificial intelligence, and data science. The complete code used in this story can be found in this repo: https://github.com/clumsyhandyman/mad-from-scratch

Revisit policy iteration

In policy iteration, we iteratively alternate policy evaluation and policy improvement. In policy evaluation, we keep policy constant and update utility based on that policy.

In policy improvement, we keep utility constant and update policy based on that utility.

The whole process of policy iteration is then

We start with any arbitrary policy π₁, gets its utility v₁ by policy evaluation, gets a new policy π₂ from v₁ by policy improvement, gets utility v₂ of π₂ by policy evaluation, … until we converge on our optimal policy π*.

Here every utility is defined as the utility of a certain policy. For instance, policy π₁ has its associated utility v₁, policy π₂ has its associated utility v₂, …, and policy πᵢ has its associated utility vᵢ.

What if we define utility as the utility of the optima policy?

In other words, can we determine the utility of π* directly in a single process without bothering alternations between evaluation and improvement?

Bellman equation

First, let’s see what it implies if utility is defined as the utility of the optima policy. No matter what policy the utility is associated with, we know for sure that a utility of a state is the sum of its immediate reward and the utility of its successor state with a discounted factor.

Since now the utility is defined as associated with the optima policy, v(s′) is already the best we can get after we reach that state s′. In other words, if we choose to go to s′, there may be multiple paths in the future following s′, but v(s′) is defined as the outcome of the best path among all these possible paths.

Now we are standing at state s and we need to choose an action from [a₁, a₂, …]. Let’s see what if we choose a₁. If we have a₁ in mind, we know the possible outcomes and thus the expected rewards.

This is valid for choosing any possible action aᵢ.

Among all these possible actions, which one should we choose? We should choose the action that gives us the highest utility of the state. In other words, among all possible aᵢ, we should choose the aᵢ that has the highest v(s|a=aᵢ). By definition, this utility value is then v(s) because it is the best outcome we can get after we reach the state s.

Therefore,

This is called the Bellman equation after Richard Bellman and this is the key of solving MDP. In other words, to solve MDP is to solve Bellman equation. Policy iteration we talked about in previous story is one method to solve it: by alternating evaluation and improvement. Another method to solve Bellman equation is called value iteration which assesses the utility directly. Let’s see how we can implement value iteration in our gird world example.

Value iteration

Unlike policy evaluation which has linear equations that can be solved directly, in value iteration, because of the max operation, the equations are not linear anymore. As a result, we have to use an iterative procedure to solve them.

Similar as we did in policy iteration, we start from initializing the utility of every state as zero and we set γ as 0.5. What we need to do is to loop through states using the Bellman equation.

Let’s start from s = 0,

Again we are using an in-place procedure, which means from now on whenever we saw v(0), it is −0.04 instead of 0. Move on to s = 1, we have

After we repeat this for states 2, 3, … all the way to 11, we get this utility.

Now it is time to iterate again, starting from s = 0 to s = 11.

And iterate again,

We repeat iteration until the change of utility between two consecutive iterations are marginal. After 11 iterations, the change of utility value of any state is smaller than 0.001. We stop here and the utility we get is the utility associated with the optimal policy.

Compared with policy iteration, why value iteration also works is because it incorporates the max operation during the value iterations. Since we choose the maximum utility in each iteration, we implicitly perform the argmax operation to exclude the suboptimal actions and converge to the optimal action.

Get optimal policy

Using value iteration, we determined the utility of the optimal policy. Similar as in policy iteration, we can get the optimal policy by applying the following equation for each state.

If we compare the utilities obtained using value iteration to those of using policy iteration, we can find that the utilities are very close. As we discussed before, these utilities are solutions of the Bellman equations. Policy iteration and value iteration are just two alternative methods to solve the Bellman equations. Therefore, for the same MDP with the same Bellman equations, regardless of the method, we should get the same results. In practice, because of the differences such as stop criterion in algorithms of policy iteration and value iteration, we get slightly different results.

Since the policy is determined by relatively rankings of utility values, not the absolute values, slightly different utility values usually do not affect the choice of policy. As shown in our example, using policy iteration and value iteration, we get identical policy.

Effects of discounted factor

Changing the discounted factor does not change the fact that these two methods are still solving the same Bellman equations. Similar as γ of 0.5, when γ is 0.1 or 0.9, the utilities from policy iteration and value iteration are slightly different while the policies are identical.

Similar as the number of sweeps in policy evaluation during policy iteration, in value iteration, larger γ requires more iterations. For our example, when γ is 0.1, it only takes 4 iterations for the change of utility values, which is denoted as Delta, to be less than 0.001. In contrast, it requires 11 iterations when γ is 0.5 and 67 iterations when γ is 0.9. Same as in policy iteration, larger γ tends to generate better results but demands the price of more computation.

Pseudo-code of value iteration

Similar as policy iteration, we use a threshold θ as the stop criterion. Unlike policy iteration, we do not need to initialize a policy. In fact, we do not need to consider policy until at the very end. After the utility is converged, we derive a policy which is the optimal policy.

Implement value iteration in Python

Similar as in policy iteration, for the purpose of learning, we incorporate the plots of learning curve visualizing the number of iterations. When determining the optimal policy, if there is a tie between actions, we randomly choose one of them as the optimal action.

The following example shows how to solve a grid world problem using our value iteration code. After preformed value iteration solver, we can plot the utility and policy as well as evaluate our optimal policy by repeatedly running it for multiple times.

From MDP to reinforcement learning

In today’s story and previous stories regarding MDP, we explained in detail how to solve MDP using either policy iteration or value iteration. Now it is time to look back at MDP and think about how to implement it in real life.

At first glance, MDP seems to be super useful in many aspects of real life. Not only simple games like Pac-Man but also complex systems like stock market may be represented as MDP (for instance, prices as states and buy/hold/sell as actions). However, there is a big obstacle: we do not know the reward function or transitional model. After all, if we somehow know the reward function of the MDP representing the stock market, we could become millionaires or billionaires very quickly. In most cases of real life MDP, we cannot access either reward function or transitional model.

Using our Pac-Man game in our first MDP story as an analogy, real life is like playing Pac-Man without knowing the rule or the map of the game. It is like we are standing in a mist.

In miserable real life, we do not know where the diamond is, where the poison is, where the walls are, how big the map is, what the probability that the robot accurately execute the intended action is, what the robot will do when it does not accurately execute our intended action… All we know is we choose an action, reach a new state, receive −0.04 (pay a penalty of 0.04), continue to make a choice of actions, reach another state, receive −0.04…

In other words, MDP we talked here is fully observable while real life is not. Methods such as policy iteration and value iteration can solve fully observable MDP. In contrast, if reward function and transitional model are not known, that is where machine learning fits in. Since we do not know reward function and transitional model, we need to somehow learn them using methods such as reinforcement learning. We will move onto reinforcement learning in future stories. Stay tuned!