Key Takeaways 2 (DP and GPI) From Sutton and Barto RL textbook

What you should know about dynamic programming and policy iteration

Published in

MITB For All

10 min readSep 1, 2023

Chapter 4 of the RL textbook by Sutton and Barto. You will learn how Policy Evaluation and Policy Improvement works, and the difference between Policy/Value Iteration and GPI.

This is the second part of the series on the Reinforcement Learning (RL) textbook by Sutton and Barto. For background context, please refer to the first article. Again, this is a supplement and not a replacement to the textbook.

Dynamic Programming (Chapter 4)

Without further ado, let us dive right in to the technical details!

As a quick refresher, DP is a concept in programming about breaking up a problem into small parts. We solve the trivial parts, and put them together to find the solution to other parts which depend on parts that had earlier been solved.

A very simple example is the application of DP rather than recursion to solve the Fibonacci sequence.

Pre-requisites

Make sure you have the robust understanding of what a function is, before proceeding. For those who are less mathematically inclined and prefer to visualize with examples, this may help. If the independent variable is one dimensional, you have a line or curve, such as y(x) = sin(2x). If there are two one-dimensional independent variables, you have a surface, such as f(x,y) = sin(x+y).

On the left is a curve corresponding to sin(2x). On the right is a surface corresponding to sin(x+y). Image by author using matplotlib.

Now, suppose we have a variable s, which represents a point in two-dimensional space. Do you agree that the function f(s) is also a surface? When we talk about solving for v_{\pi}(s), that’s exactly what we are trying to solve for!

Of course, the state s could well be of much higher dimensions. It is the same; we simply get a hypersurface in hyperspace.

Policy Evaluation

Section 4.1 talks about computing the state-value function v_{\pi} for an arbitrary policy π. Here, we are not talking about any sort of intelligent choosing of actions. It is merely a matter of evaluating what is the expected G of some given policy π, which could even be random.

Sutton and Barto also noted that if the environment’s dynamics are completely known, equation (4.4) is a system of linear equations, where the number of equations is equal to the number of unique states. (For ease of representation moving forward, let’s denote $\left|\mathcal{S}\right|$ as n.)

Why is this so? Because v_{\pi}(s) is essentially a curve or surface/hypersurface encompassing n unknown points. We can write equation (4.4) in n ways, substituting the value of each unique state into the left-hand-side of each equation. With p(s',r|s,a) known, the right-hand-side of the equation can be expanded. We can compute the expected discounted return for all possible other states multiplied by the transition probability (which could well be zero if it is impossible to get from s to s' with reward r by performing action a), the sum itself weighted by the action probability under the given policy π.

When solved simultaneously, we can get the solution directly without iterations. [See the math equations and code for Example 3.8 in key takeaways 1 (Bandits, MDPs, and Notations), where the exact solution was solved analytically.]

You would see Sutton and Barto talk about iterative policy evaluation. Applied to Example 3.8, we would have something as follows.

import numpy as np
from copy import deepcopy
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

class Env:
    def __init__(self):
        self.min_row = 0
        self.max_row = 4
        self.min_col = 0
        self.max_col = 4
    
    def transition(self, state, action):
        if state == [0,1]:
            state = [4,1]
            reward = 10
        elif state == [0,3]:
            state = [2,3]
            reward = 5
        else:
            if action == 0:
                state[1] += 1   # move right one column
            elif action == 1:
                state[0] += 1   # move down one row
            elif action == 2:
                state[1] -= 1   # move left one column
            elif action == 3:
                state[0] -= 1   # move up one row
            else:
                assert False, "Invalid action"
            
            if (state[0] < self.min_row) or (state[1] < self.min_col) \
                or (state[0] > self.max_row) or (state[1] > self.max_col):
                reward = -1
            else:
                reward = 0
        
        next_state = np.clip(state, [0,0], [4,4]).tolist()
        return reward, next_state

class Explorer:
    def __init__(self, Env):
        self.Env = Env
        self.state = [0,0]   # list indicating row and column
        self.policy = lambda s: np.random.choice(4)
        self.v_s = np.zeros((self.Env.max_row+1, self.Env.max_col+1))
        self.gamma = 0.9
        self.lr = 0.1
    
    def step(self, learn=True):
        # action = np.random.choice(4)   # 0:right, 1:down, 2:left, 3:up
        action = self.policy(self.state)
        
        # reward, self.state = self.Env.transition(self.state, action)
        cur_state = self.state
        reward, next_state = self.Env.transition(
            deepcopy(cur_state), action
        )
        
        if learn is True:
            self.lr *= 0.99999
            self.learn_value_fn(cur_state, reward, next_state)
            
        self.state = next_state
        return reward
    
    def learn_value_fn(self, cur_state, reward, next_state):
        self.v_s[cur_state[0],cur_state[1]] += \
            self.lr * (reward + \
                self.gamma*self.v_s[next_state[0],next_state[1]] \
                - self.v_s[cur_state[0],cur_state[1]]
            )
        # note that I used bootstrapping here, so as not to reveal
        # the answer to the assignment question which I will set.
    
    def run(self, n_iter=10000):
        G = 0
        for _ in tqdm(range(n_iter)):
            r = self.step()

env = Env()
agent = Explorer(env)

agent.run(500000)
sns.heatmap(
    agent.v_s, annot=True, cmap='coolwarm', fmt='.1f'
)

We see the update rule in equation (4.5). Do not confuse v_{k+1}(s) with v_{\pi}(s). Here, v_{k+1}(s) refers to the value function after k+1 iterations, for the policy π. It is computed based on v_{k}(s) of the other states, which themselves may be still subject to changes upon further iterations. The textbook authors described the procedure of implementing this until convergence.

Example 4.1, and the results shown in Figure 4.2 (page 93), deserves further attention.

Replication of a portion of Figure 4.2. Image by author using matplotlib and seaborn. Just leave a note in the comments if you want the code for this.

Notice the column under ‘v_{k} for the Random Policy’. At k = ∞, the state-values are shown to range from -22 to 0, and to the right is a greedy policy associated with this value function. This ‘convergence’ at k = ∞ actually corresponds just to the random policy, and each corresponding policy to the right is simply the argmax with respect to the computed state values.

There is no policy iteration here, only policy evaluation of the random policy. It is merely a coincidence that if we act greedy on the converged value function, we immediately reach the optimal policy in the next iteration.

Policy Improvement

In section 4.2, we learn to do something meaningful after a policy has been evaluated.

Equation (4.6) is simply a restatement of equation (3.15) in page 75, just that we have q_{\pi} instead of q_{*} since it is now for some policy π.

This brings us to the policy improvement theorem. Equation (4.7) considers the case where we select an action a, chosen from policy π′, and thereafter follow policy π. If this results in the action value function q_{\pi} being better (or as good as) v_{\pi} for all states, then it must be the case that the state value function for π′ is better (or as good as) π. By applying a strict inequality, it leads to the conclusion that if we select an appropriate action a in some specific state s (and thereafter keep to the original π) such that G (the expected cumulative discounted rewards) increases, then we have improved the policy.

By extension, if the greedy action is selected at all states, the preceding paragraph holds. Meanwhile, if the original policy π and the new greedy policy π′ has the same state-values for all s, it implies that both policies are already optimal. Putting this together, we conclude that acting greedily would necessarily improve a non-optimal policy.

Wait a minute. Doesn’t this sound misleading? Are we saying that simply choosing a greedy action is the best? No!

Let’s go back to the technical definition, and pay attention to the details. Under the policy improvement theorem, we have q_{\pi}(s,\pi’(s)) \geq v_{\pi}(s). The expected cumulative rewards is computed over policy π, ie. what we get if we act according to policy π. Think of the policy improvement step this way: “Given what I know now (i.e., the value function of my current policy), what is the best thing to do at each state, if I were to evaluate based on my existing knowledge?” Well, it is to take the greedy action. However, the existing knowledge itself may be imperfect, which is why we also need to explore.

Return to Figure 4.2 to check your understanding (I paste it below for your convenience). The value functions on the left corresponds to that of the random policy, and choosing a greedy policy with respect to those evaluated value functions necessarily lead to an improvement of the original policy (which was random in the given example). It just so happens that from k=3 onwards, an optimal policy is obtained (can verify by observation).

Replication of a portion of Figure 4.2. Same figure as above, pasted here for your convenience

This leads us to the next part of the chapter, which is to iteratively improve both the policy and the computation of our value function.

Policy and Value Iteration

In section 4.3, we see an alternating sequence of policy evaluation and policy improvement. What it means is that we can start off with any policy \pi_{0}, evaluate the value for v_{\pi_{0}} (see section 4.1), get an improved policy \pi_{1} (see section 4.2), and repeat this process until the optimal policy \pi_{*} is obtained.

Look at the pseudo-code in Figure 4.3. ‘Step 2’ itself is an iterative process (until convergence). Policy iteration involves performing step 2 and step 3 repeatedly until the policy is stable, ie. when there is no change to the predicted actions across the state-space.

In section 4.4, Sutton and Barto defined value iteration as a special case of policy evaluation with just one sweep. In terms of implementation, only one thing changes—the policy evaluation step is truncated. Here, we perform just one round of iteration in step 2 (instead of waiting for convergence), and move from v_{k}(s) to v_{k+1}(s) for all states s. You may visualize this as having an updated curve or surface/hypersurface. In step 3, the policy is updated and the probability of taking each action changes. Note that p(s',r|s,a) is a property of the environment and remains unchanged. The subscript k in equation (4.10) hence corresponds to the number of times we have iterated through steps 2 and 3.

Why one sweep in each policy evaluation step? Because it is not necessary to get v_{\infty} each time. Let’s look back at Figure 4.2. The value function at k=3 is already gives us a very good improved policy when acting greedily—further iterations until k=∞ are just a waste of computational resources. Even though the absolute values differ substantially, the relative values are what determines the greedy policy.

Try coding out the solution to Example 4.2 by yourself, and see if the results are similar to that shown in Figure 4.4. I deem this as a good checkpoint to evaluate your understanding of the concepts, and should not be skipped.

As I will be doing a code walkthrough of this in class, the code will only be updated in this article later this month for you to check.

Asynchronous DP

Up till now, the updates in policy iteration or value iteration involve a sweep through the entire state-space. This may be infeasible when the state space is very large. Moreover, it may be superfluous.

Sutton and Barto also stated that “Some states may not need their values backed up as often as others.” Particularly those that are not relevant to the optimal behaviour.

I do not want to go into the technical details of ‘asynchronous’. Nor is it meaningful to go down this path, as it may cause us to lose focus of what is the key takeaway in section 4.5. It is not about parallel or distributed computing.

The important idea here is that “some states may be backed up several times before the values of others are backed up once.” For example, this would be the case if backup is done on the states as and when they are visited. Notice that this actually goes hand-in-hand with the idea of online learning.

In DQN and Actor-Critic (which I consider the RL-equivalent of ResNet in computer vision—big names that exist for many years and, while not cutting edge, are reliable stalwarts for a first-cut solution), this concept is naturally applied.

Here are some small disclaimers for completeness. The contents discussed up till section 4.4 are model-based where we have knowledge of p(s',r|s,a), whereas DQN and Actor-Critic are model-free and work with samples. We have also not talked about the use of function approximation (think neural network). Furthermore, when we use experience replay, we may backup states that are encountered many episodes ago, not “as and when visited”. Just keep these at the back of your mind.

Generalized Policy Iteration

Sutton and Barto described generalized policy iteration (GPI) as “the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.”

Policy iteration (section 4.3) and value iteration (section 4.4) can be seen as special cases of GPI. In the former, the value function is updated only after convergence, while in the latter it is updated after one sweep across all states. Instead of doing that, it is possible to adjust the value function by as little as performing an update a single state, and improving the policy by acting greedily across this function. It is very possible that there is no change to the policy. No harm is done, because we simply proceed to update the value function again by accounting for the visit to another state.

As for section 4.7, I deem it as a commentary that extends section 4.5.

Conclusion

Originally, I had intended to cover chapters 4 and 5 together in this article. However, this is already 10 minutes of read time!

I had been somewhat verbose, as there are many subtle details which might easily overlooked and warrants a more detailed discussion. After all, rushing through things would lead to a shaky foundation — that is not good.

As usual, I will be happy to clarify doubts which you may have.

In my third article for this series, I will be writing about Chapter 5, which covers Monte Carlo methods. I target to submit it this Sunday (glad that today is a public holiday, and I have the weekend to work on it.)

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.