Demystifying Value and Policy Iteration Algorithms in Reinforcement Learning (Part 1)

12 min readAug 18, 2023

For the longest time, the concepts of value iteration and policy iteration in reinforcement learning left me utterly perplexed. Although both methods lead to the same result, they employ different approaches to achieve this outcome. In this new blog series, I aim to simplify these concepts based on my understanding, hoping that it will help you grasp them as well. Before diving into the details, I recommend watching this insightful video by Alice Gao on YouTube: [Link to the Video].

Now, let’s take a step back and provide some essential background and context. First, let’s discuss Reinforcement Learning (RL):

Reinforcement Learning (RL):

Reinforcement learning is one of the three broad categories of machine learning, alongside supervised and unsupervised learning, for those who may be less familiar with the field. In a concise definition, RL is the process of teaching a machine (an agent or agents) how to act intelligently within an environment, whether static or dynamic, through trial and error. Over time, the agent becomes capable of autonomously making decisions within the environment to reach a specific goal. The agent strategically selects actions that lead to the highest rewards or the least penalties. To put this into perspective, imagine yourself driving a car with the sole objective of minimizing fuel consumption while navigating from point A to point B. You are faced with three different routes, as depicted below:

[Source: Author] Travel Paths for Destination ‘B’ from ‘A’.

Path A: Longer path, but safe route. (Path A goes over the bridge)
Path B: Shorter route, but 90% chance of being robbed under the bridge, just a few kilometers before reaching destination B. (Path B goes under the bridge)
Path C: Shortest, straightest, but road untarred, muddied path which causes ten times more friction on car tires.

When faced with choosing three different paths, with the sole objective (or reward) of reducing fuel costs, it may seem intuitive to opt for the slower and steadier option (path A). However, this approach might not align with our true objective, as we seek to minimize fuel consumption. In such cases, the optimal strategy would be to take path B, even if facing a 90% chance of encountering robbers. The reason is, path B allows our car to burn less fuel, making it the smarter choice in the context of fuel efficiency.

At first glance, this decision might sound illogical, and you might think, “This guy is crazy!” Well, in a way, you’re right — I am crazy. But machines or agents in RL can be even crazier if they lack proper guidance on what is relevant and what is not. It highlights the importance of crafting a well-defined objective or reward function for our agent. Such a function should not only prioritize reducing fuel costs but also ensure the safety of humans arriving at their destination. After all, what’s the point of the car reaching its endpoint if no one in it will be safe?

RL offers a powerful framework for teaching machines to act intelligently within an environment. The agent learns to autonomously make decisions, selecting actions that lead to the highest payoff, all while achieving a specific goal. Before delving deeper into the methods of value iteration and policy iteration, let’s touch on the fundamental backbone of RL — the Markov Decision Process (MDP). This framework comprises four key elements: the agent, state, action(s), and reward. Each element plays a crucial role in shaping the learning process. We’ll explore this further as we dive into the workings of value iteration.

The MDP framework serves as the bedrock of RL, and the existence of RL depends on these four building blocks. When transitions between states are fully known, the environment is entirely observable by the agent and is known as a fully observable MDP (FOMDP). However, when some or all transition probabilities are unknown, it becomes a partially observable MDP (POMDP), presenting its own set of challenges. With these foundational concepts established, we can now focus on the main subject of this blog post: Value Iteration (VI) and Policy Iteration (PI) algorithms. Both algorithms are utilized for FOMDP, where the transition probabilities between states are completely known. Let’s explore each of these algorithms in detail, one at a time.

Value Iteration — Understanding the Concept of ‘Value’ in Value Iteration

To grasp the essence of Value Iteration (VI), let’s begin by exploring what we mean by ‘Value’ in the context of RL. When something is considered valuable to us, our behavior tends to exhibit certain characteristics:

We attribute significant worth to it
We strive to preserve or acquire it
We expect it to retain its worth over time

In the field of RL, the concept of ‘value’ is central to the understanding of VI. Specifically, the value of a state signifies the expected sum of discounted rewards the agent anticipates obtaining while progressing toward its defined goal or objective. Mathematically, this is represented by the equations below:

Where V(s) denotes the value of the agent being in state ‘s’. R(t) is the immediate reward obtained at time step ‘t’. 𝞬 is the discount value which is usually between 0 and 1 i.e., 𝞬 ∈ [0, 1], and represents the relative importance an agent places on immediate or long-term rewards. When 𝞬 = 0, it means the agent only considers the immediate reward since all other terms in equation (1) becomes zero, however, when 𝞬 = 1, the agent places equal importance on all subsequent or future rewards, in other words, if the reward now (Rt) was $10, a discount value 𝞬 = 1, means that the reward at R(t+1), R(t+2), etc. would also be $10. Unless in very rare scenarios, this is usually not the case. For instance, you would prefer I give you $1000 today than wait till next year to get that same $1000. This is why practically 𝞬 is set to values typically between 0.7–0.99. In this way, the agent can factor in the trade-offs between immediate and future reward values. From the law of expectation, eqn. (1) can be expanded as shown below:

The expectation ‘𝔼’ can be re-written as P(s’|s, a) i.e. The probability of arriving at the next state s’, if at state s, action ‘a’ is taken.

By understanding the notion of ‘value’ in this context, we lay a strong foundation for comprehending how VI unfolds and why it serves as a crucial method in RL. As we progress further, we’ll delve into the mechanics of VI and its practical implications in training intelligent agents to make informed decisions in dynamic environments. But first, let’s again circle back and relate RL concepts to real-life issues, with this we could explain further the intricacies of the value iteration algorithm.

Relating RL Concepts to Real-Life Scenarios

Understanding RL becomes even more relatable when we draw parallels to real-life scenarios. Let’s consider the experience of dining at a restaurant as an analogy to grasp the concepts further.

Imagine you visit a restaurant with the goal of feeling satisfied after your meal. We can break down this scenario using RL terminology:

State: Your initial state is “hungry.” It represents your current condition or situation.
Actions: The actions available to you in this situation are to either “eat the meal” or “stare and admire the meal without eating it.”
Rewards: You associate a particular reward with each action. If you choose to “eat the meal,” the reward might be high, as it satisfies your hunger. However, if you decide to “stare and admire the meal without eating it,” the reward would likely be low, as it doesn’t address your hunger.

Now consider this: if you remain in the “hungry” state and merely stare at the meal for 15 minutes without eating it, the probability of feeling full after that time would be close to zero. It’s a humorous example, but it highlights the importance of selecting the right actions to achieve your objective (feeling satisfied). In RL, similar to your dining experience, the agent (machine) makes decisions to reach a specific goal, taking actions based on its current state, aiming to maximize the expected cumulative rewards. By relating RL concepts to relatable real-life scenarios, we can develop a more intuitive understanding of the algorithms and methods employed in this exciting field. The insights gained from our dining analogy will help us appreciate the significance of the equations and their implications for training intelligent agents effectively.

Returning to eqn. (3), it can be re-written as:

Remember that the expectation ‘𝔼’ was re-written as P(s’|s, a). Indeed, what we have just rewritten (eqn. 4) is known as Bellman’s equation — a pivotal concept in RL. However, we need to be mindful of a crucial detail that can significantly impact the value of each state.

Here’s the catch: Depending on the actions taken in any given state, the resulting value of that state will vary. Consequently, to determine the most accurate value of a state, we must consider the action that yields the maximum value in that particular state.

In light of this insight, the real and more precise version of Equation (4) can be expressed as follows:

The real Bellman’s equation (eqn. 5) takes into account the maximum value obtained from the available actions in a state, ensuring that the agent selects the most favorable course of action to optimize its expected rewards. This refined equation highlights the significance of careful decision-making in RL. By considering all possible actions and their respective outcomes, the agent can learn to make the best choices, achieving its goals more efficiently within the dynamic environment. It’s important to note that eqn. (4) is a particular case of the Bellman equation when only one (1) action is possible.

Understanding Value Iteration through the Grid-World Problem

Now that we have clarified the importance of the real Bellman’s equation, let’s delve into the inner workings of Value Iteration and explore a modified classic example commonly used in Reinforcement Learning: the Grid-World problem.

[Source: Author] Modified Grid-World Problem

In the context of the Grid-World problem, imagine an agent navigating a grid-like world, where each grid cell represents a state. The agent can move up, down, left, or right from any given state, encountering rewards and penalties along the way. For this problem, let’s walk through the MDP framework, considering the key components:

State: The problem has a total of 15 states, represented by tuples of indices or coordinates such as (1, 1), (1, 2), …, (3, 5). States (1,5) and (3,4) are the terminal states, while (1,4) and (2,2) denote impossible states.
Actions: In this grid-world example, the agent can take 4 possible actions in each state: up, down, left, and right. However, there’s an element of stochasticity introduced, where the intended action may veer off with a 20% probability due to external factors. For example, if the agent intends to go up, there’s a 10% chance it goes right and a 10% chance it goes left instead.
Reward: The reward function of states (1, 5) and (3,4) are given as R(1,5) = +1 and R(3,4) = -1 respectively. For all other empty states, the reward function defaults to R(s) = -0.1. This implies that the agent incurs a penalty for taking longer to reach the goal. Therefore, the agent’s objective in this problem case is to start at state (2,1) and reach state (1,5) with the maximum reward.
Agent: The agent is the software on which the Reinforcement Learning algorithm is implemented, specifically the Value Iteration algorithm. As a parameter, the discount factor (𝞬) is selected as 0.8. The Value Iteration algorithm empowers the agent to achieve two primary objectives:

Objective 1:

Find the Maximum Value: At each iteration, the agent explores the 4 possible actions in each state and computes the value using the Bellman’s equation. It then selects the maximum value, signifying the most valuable outcome for that state. The agent sets the new value for the next iteration as the maximum value obtained in the current iteration. This process is repeated until convergence, where the new value is approximately equal to the current value with a predefined threshold (e.g., 1e-6).

The code snippet for the preparatory step is shown in the code block below. The complete solution using the value iteration algorithm can be found on my GitHub page here.

import numpy as np

# Define the rows and columns
n_rows = 3
n_cols = 5

# Define the terminal states and the walls/impossible states
terminals = [(0, 4), (2, 3)]
walls = [(0, 3), (1, 1)]

# Define the possible actions at each state
actions = ["UP", "RIGHT", "DOWN", "LEFT"]

# Define the row and column changes from each state to new states depending on actions taken
row_change = [-1, 0, 1, 0] # up, right, down, left
col_change = [0, 1, 0, -1] # up, right, down, left

# Define the motion model
motion_probs = {
    'UP':    {'UP': 0.8, 'RIGHT': 0.1, 'DOWN': 0.0, 'LEFT': 0.1},
    'RIGHT': {'UP': 0.1, 'RIGHT': 0.8, 'DOWN': 0.1, 'LEFT': 0.0},
    'DOWN':  {'UP': 0.0, 'RIGHT': 0.1, 'DOWN': 0.8, 'LEFT': 0.1},
    'LEFT':  {'UP': 0.1, 'RIGHT': 0.0, 'DOWN': 0.1, 'LEFT': 0.8}
}

# Definition and initialization of the reward function
R = np.full((n_rows, n_cols), -0.1)
R[0, 4] = 1 # Reward at terminal state I (Goal state)
R[2, 3] = -1 # Reward at terminal state II (Bad state)

# Define the number of episodes (or iterations)
episode = 100

# Define the discount factor
gamma = 0.8

# Define threshold
threshold = 1e-6

# Definition and initialization of the value for each state.
v = np.zeros((n_rows, n_cols))

Objective 2:

Extract the Policy: After the agent obtains the “converged” value for each state, it proceeds to extract the policy or action that led to the maximum value in each state. The policy is the strategy that guides the agent on what action to take in each state to achieve its objectives efficiently.

The ultimate goal is to derive the optimal policy that instructs the agent to take the most rewarding actions at each state, allowing it to navigate the grid world effectively and reach its goal in the shortest possible time. By employing the MDP framework and the Value Iteration algorithm, the agent can make informed decisions, optimizing its behavior in complex environments, and showcasing the power and potential of RL in real-world applications.

We see from these two phases that while objective 1 may take a couple of iterations to converge, objective 2 only requires 1 step.

The code snippet for this step is shown below, also the complete code can be found on my GitHub page here.

# Deriving the optimal policy

# initialize the opt_policy with a series of texts
opt_policy = [['undecided', 'undecided', 'undecided', 'None', 'Terminal'],
              ['undecided', 'None', 'undecided', 'undecided', 'undecided'],
              ['undecided', 'undecided', 'undecided', 'Terminal', 'undecided']]


for row in range(n_rows):
    for col in range(n_cols):
        if (row, col) == terminals[0]:
          v[row, col] = 1
          continue
        elif (row, col) == terminals[1]:
          v[row, col] = -1
          continue
        elif (row, col) in walls:
          continue
        else:
          p_next_states = []
          v_next_states = []
          for i, a in enumerate(actions):
              next_row = row + row_change[i]
              next_col = col + col_change[i]
              if next_row < 0 or next_row >= n_rows or next_col < 0 or next_col >= n_cols or (next_row, next_col) in walls:
                  next_state = (row, col)
                  p_next_states.append(next_state)
                  v_next_states.append(v[next_state])
              else:
                  next_state = (next_row, next_col)
                  p_next_states.append(next_state)
                  v_next_states.append(v[next_state])
          v_check = []
          for a in actions:
            act_prob = list(motion_probs[a].values())
            temp_res = [v_next_states[i] * act_prob[i] for i in range(len(v_next_states))]
            sum_temp_res = sum(temp_res)
            v_check.append(sum_temp_res) # ∑(P(s'|s,a)* V(s')
          max_index = np.argmax(v_check) # Extract the index of the highest value in that state: argmax(∑(P(s'|s,a)* V(s'))
          opt_policy[row][col] = actions[max_index] # Update the policy

print("The optimal policy for each state is as shown below: \n\n")
opt_policy

After running the algorithm for about sixteen (16) episodes, the value of each state converged as shown in Phase I, and the optimal policy as shown in Phase II.

[Source: Author] Optimal Policy for Modified Grid-World Problem

The next article discusses policy iteration and its many dynamics, although both VI and PI still yield pretty much the same results, policy iteration requires way fewer iterations to achieve satisfactory results.

Acknowledgment

A huge shoutout and gratitude to Dr. Abderrahmane Lakas, a Professor in the Department of Information Technology, at United Arab Emirates University for his immense help in understanding the concepts of reinforcement learning.

I’m sure you got some expected value 😊 from reading this article, please do leave a few claps for me and hit the follow button to receive more contents like this!

Demystifying Value and Policy Iteration Algorithms in Reinforcement Learning (Part 1)

Written by Ayodele Benjamin Esan 🌱⚡️