Reinforcement Learning Chapter 4: Dynamic Programming (Part 3 — Value Iteration)

Numfor Tiapo
3 min readMar 6, 2023

--

Chapter 4 Series:

Code: https://github.com/nums11/rl

In the previous articles, we learned about the Policy Iteration algorithm and saw how to implement it and use it on Grid World. In this article, we’ll look at an alternative to policy iteration.

Policy Iteration Problems

Recall the policy iteration algorithm:

The main drawback to policy iteration is that it requires a Policy Evaluation loop at every step (step 2) which, in and of itself, is an iterative process. This makes the policy iteration algorithm quite computationally expensive.

Value Iteration

What if we combined the Iterative Policy Evaluation (step 2) and Policy Improvement (step 3) steps into 1 step? We can do this in an algorithm called value iteration. The algorithm for value iteration is shown below:

  • You can think of it as similar to policy evaluation in that we’re repeatedly updating the state value function v one state at a time and each time we update the state values for all the states, we consider that a new version of v (v₀, v₁, …).
  • In regular policy evaluation, we update the value of a state to be a weighted average of the states that could follow (weighted by the likelihood of those states occurring). However, in value iteration we update the value of a state to be the value of the action that would lead to the maximal expected return as opposed to a weighted average of the possible next states.
  • Value iteration takes the argmaxₐ from policy improvement and directly combines it with the updates for state values. By doing this each iteration of the value function gets closer to the optimal value function v*.

Implementation

Below is an implementation of the previous Policy Iteration Agent and an additional Value Iteration agent for the Grid World described in the previous article:

Testing

  • You can see that after 4 iterations the agent is able to estimate the optimal value function and derive the optimal policy. This is faster than the 5 iterations that policy iteration took (and each iteration of policy iteration contains multiple value iteration loops).

In the next and final article for the chapter, we’ll summarize our learnings and discuss the role that these Dynamic Programming methods play in RL at large.

--

--