Demystifying Value and Policy Iteration Algorithms in Reinforcement Learning (Part 2)

Ayodele Benjamin Esan 🌱⚡️
6 min readAug 31, 2023

--

In the previous article on Value Iteration (link here), we extensively reviewed value iteration and its working dynamics, wrapping it up with a practical modified grid world problem. This article focuses on policy iteration, the second of the duo model-based RL algorithms. The Policy Iteration (PI) algorithm is an alternative approach to the Value Iteration (VI) method. While both methods yield similar results, PI achieves this goal in significantly fewer iterations. Let’s explore how PI does this.

The core objective of any RL agent is to discover the optimal policy or strategy to adopt in each state, ensuring the maximization or minimization of its objective. In essence, the absolute ‘value’ of being in a state is not of primary concern; rather, what matters most is identifying the best action to take while in that state. This fundamental distinction is where PI sets itself apart.

In PI, a random policy is initially generated and assigned to each state in the environment. The agent then executes these actions, subsequently ‘learning’ from the outcomes. Similar to how a toddler learns from touching a lit candlestick and getting burnt, the agent internalizes the repercussions of its actions, associating negative rewards with undesirable outcomes. As the agent acts based on the instantiated random policies, it computes the value of being in a state (Vt) while following these random actions or adhering to a random policy. It performs this computation for every state in the environment, ultimately refining the Bellman equation from the VI article.

Given an initial random policy as shown, the refined Bellman eqn. is as shown in eqn. (2).

Did you catch the difference here? Notice that now in equation (2), the ‘max’ operator is gone. This is because all possible actions in each state are not computed, since a policy was already made available. This linearizes eqn. (2), thus making it much faster to solve as opposed to eqn. (1) which is non-linear and depending on the environment may take longer to solve. By adopting a more directed approach towards learning optimal policies, PI exhibits faster convergence, making it an appealing choice in certain RL scenarios. Its ability to rapidly determine the most effective strategy enables agents to navigate complex environments more swiftly, opening up new possibilities for practical implementations of RL across various domains.

Building Blocks to Optimal Policy

Photo by Nick Page on Unsplash

But how do we know if the random policy we instantiated is the best policy? Well, we can’t except we know one next step; which is a look-ahead value computation for each state in the environment (Vt) using all possible actions. After this is computed, a comparison can then be made between this value and that obtained from the random policy V𝜋(t).

Okay, this last bit sounds confusing, I would know because I was also confused the first time I came across it. Let’s try to work through it again, now slowly;

Think of this like a Lego block, let’s break down the process into four distinct blocks to understand how it achieves faster convergence.

Block 1 [Initialization]:

The agent starts by laying the foundation i.e., initializing a random policy (𝜋_random) for each state in the environment. At this point, the agent performs one action at each stage, regardless of whether it is optimal or not.

Block 2 [Policy Evaluation]:

Being a smart agent (or at least trying to be), it doesn’t settle for the initial random policy. It undertakes policy evaluation to critique the current policy for each state. To effectively critique, the agent needs a benchmark, a policy to compare against, just like your teacher needs a marking guide to grade your assignment accurately.

Remember when you were in school and your teacher gave you an assignment that you had to submit the next day? How do you think your teacher was able to grade your assignment and provide you with feedback? Well, some of them were too busy and probably assigned the grading to their Teaching Assistants (TAs). However, a good professor will always provide the marking guide or rubrics to the TA so that he/she can effectively critique your assignment.

This is precisely what the policy evaluation step entails; it needs an authentic guide or benchmark, through which it can critique the current policy and correct it to a better policy accordingly. But it needs to compare apples-to-apples right? It already has a random (your trial) policy, yes, but it does not have a benchmark ‘policy’ (marking guide) on which to critique, it only has pairs of possible actions to take in a state. Equations (1) and (2) tell us something, which is we can compute the value of being in a state when we have a set of actions (eqn. 1) and also when instead a random policy (𝜋) is available (eqn. 2). Boom! Now our agent can do the apples-to-apples comparison.

Block 3 [Comparison]:

Remember we are building the blocks. The agents compute the V𝜋(s) for each state — [our solution assignment]. Next, the agent computes the V(s) of being in that state using all possible actions. Note that it computes V(s) following the Bellman equation in (1).

In the end, we have V𝜋(s) and V(s). The verdict of the agents’ critique is thus:

If V(s) > V𝜋(s): It updates the policy at this state to be equal to the action that maximized the value in V(s).

Note: Technically speaking, V(s) cannot be less than V𝜋(s) based on the Bellman computation of equation (2). This is because:

1) it uses one of the actions that is used to compute V𝜋(s) and

2) it is the max of all the values computed for each action in that state

The last and final block is the policy improvement stage. What happens when the student receives his/her graded assessment from the professor? Well, a good student would want to improve on the weak areas and do better next time.

Block 4 [Policy Improvement]:

If the critique reveals that the current policy is not optimal, the agent updates the policy using equation (3). This process is known as “policy improvement.” The agent then resets all values for each state to zero and starts afresh from Block 1. The difference now is that the initial random policy is replaced by the improved policy obtained via equation (3).

The agent continues this iterative process until V(s) is close to V𝜋(s) with some tolerance (e.g., 1e-6), and the action matches the policy at that state. At this stage, the agent is confident it has found the best action or strategy to take at any state within the environment, constituting the optimal policy.

[Source: Author] Policy Iteration Algorithm Sample Workflow

Similar to the modified grid world example explained in the VI article, the code snippet for the PI algorithm is shown below, with the full code on my GitHub page.

# Initialize random policies for the environment.
# Policy Evaluation Phase

def P_eval(initial_policy, v):
for k in range(1,16):
for row in range(n_rows):
for col in range(n_cols):
if (row, col) == terminals[0]:
v[row, col] = 1
continue

elif (row, col) == terminals[1]:
v[row, col] = -1
continue

elif (row, col) in walls:
continue

else:
p_next_states = [] # Initialize list for plausible next states
v_next_states = [] # Initialize list for the values of the plausible next states

for i, a in enumerate(actions):
next_row = row + row_change[i]
next_col = col + col_change[i]
if next_row < 0 or next_row >= n_rows or next_col < 0 or next_col >= n_cols or (next_row, next_col) in walls: # Checking to ensure boundaries of the environment is not violated.
next_state = (row, col)
p_next_states.append(next_state)
v_next_states.append(v[next_state])
else:
next_state = (next_row, next_col)
p_next_states.append(next_state)
v_next_states.append(v[next_state])

rand_action = initial_policy[row][col]

act_prob = list(motion_probs[rand_action].values())
temp_res = [act_prob[i] * v_next_states[i] for i in range(len(v_next_states))] # (P(s'|s,a)* V(s'))
sum_temp_res = sum(temp_res) # ∑(P(s'|s,a)* V(s')
v[row, col] = R[row, col] + gamma * sum_temp_res # R(s) + γ(∑(P(s'|s,a)* V(s'))
return v

Acknowledgment

A huge shoutout and gratitude to Dr. Abderrahmane Lakas, a Professor in the Department of Information Technology, at United Arab Emirates University for his immense help in understanding the concepts of reinforcement learning.

I hope you enjoyed reading this piece as much as I enjoyed writing it. Please do leave a few claps for me and hit the follow button to receive more contents like this. Until next time, happy learning 😊 !!

--

--

Ayodele Benjamin Esan 🌱⚡️

I'm passionate on the application (and impact) of AI on energy systems & markets. Join me explore the limitless potentials of AI on energy landscapes 🌱⚡️.