Find an optimal policy with Finite Markov Decision Process: Part1 Dynamic Programming

Yuki Minai
10 min readNov 20, 2023

In this series of blogs, we will delve into various methods for finding an optimal policy within the context of Finite Markov Decision Processes (MDPs).

Series Overview

The series is structured into three parts, each focusing on a different approach for solving MDPs:

Part 1: Dynamic Programming (This blog)

Part 2: Monte Carlo Methods

Part 3: TD Learning

Let’s begin our journey with Part 1: Dynamic Programming.

Part 1: Dynamic Programming

What is a Markov Decision Process (MDP)?

In the field of Reinforcement Learning (RL), Markov Decision Processes (MDPs) are fundamental mathematical models used for decision-making in dynamic environments. An MDP consists of several key components:

  • States (S): Possible situations or configurations within the environment.
  • Actions (A): A set of choices or decisions that an agent can make.
  • Transitions (T): Rules or probabilities governing how the environment moves from one state to another after taking a specific action.
  • Rewards (R): Immediate numerical values that indicate the desirability of an agent’s actions.
  • Discount Factor (γ): A factor that balances immediate rewards against future rewards.

The key characteristic of an MDP is that the future state and reward depend only on the current state.

In this blog, we will explore the foundational concepts and methods required to identify an optimal strategy for maximizing rewards using Dynamic Programming.

Prepare Frozen Lake environment

In this blog, we will be working with the Frozen Lake environment provided by the Gymnasium framework.

Learn more about the Frozen Lake environment

The objective of this environment is for an agent to navigate through a grid world, starting from the initial cell and reaching the goal cell. Here, we are using a 4x4 grid map, and each cell falls into one of four different categories:

  • S (Start): This cell is where our agent begins its journey.
  • F (Frozen): These cells are safe for the agent to walk on.
  • H (Hole): These are hazardous cells, and if the agent falls into one, the episode terminates with a reward of 0.
  • G (Goal): Reaching this cell yields a reward of +1 for the agent.

From the starting cell, the agent has the option to move in four directions: up, left, down, or right. The agent’s task is to explore the grid world, making decisions at each time step to eventually reach the goal cell and collect a reward of +1.

In the below code, we can see an example of the agent randomly exploring this environment over 100 time steps.

In the above exploration, we observed that using random actions rarely leads the agent to the goal state. Now, let’s consider a scenario where we have complete knowledge about the environment. Specifically, we have access to information such as the probability of transitioning from one state to another (transition probability) and the rewards associated with each state (reward function). With this knowledge at our hand, how can we determine the optimal policy to guide the agent to its goal? This is the question we aim to address.

Before delving into the algorithms for finding the optimal policy, let’s think about whether this environment can be classified as a Markov Decision Process (MDP). The Frozen Lake environment indeed falls under the category of MDP because the transition probability and reward function for each state solely depends on the current state, without consideration of past history. For instance, irrespective of how the agent reaches a particular cell — whether it moves left or right or arrives from a left or right cell — the transition probability for the current cell remains consistent. It is not influenced by the agent’s previous actions.

Now, you may wonder what characterizes a non-MDP environment. In a non-MDP setting, the reward probability for a given state could change based on the states or the action taken by the agent in previous steps. In such non-MDP environments, it is necessary to maintain a history of states/actions to accurately learn the value for each state. However, since the Frozen Lake environment is MDP, we can focus on the state (or cell) we are currently in and make decisions and learning based on the current state information.

Policy Evaluation: Finding the Optimal Actions

When we have knowledge of both the transition probability and the reward function, our objective is to learn the optimal action to take at each state to maximize our cumulative reward. In the case of Frozen Lake, this means finding the shortest path to the goal while avoiding pitfalls. (The shortest path has the largest value because of the discount factor.)

One approach to achieve this is to explore the state space incrementally, learning in which states we can expect high rewards. In RL, the value of each state is represented by the State value, denoted as V(s). How can we calculate the state value?

Here, we learn a fundamental equation in RL used for computing the state value.

Bellman Expectation Equation

The Bellman Expectation Equation is used to calculate the expected future reward at a particular state under a specific policy. It allows us to compute the state value of each state, and it is defined as follows:

In our current setup, we have knowledge of the transition probability p(s’, r | s, a). With a specific policy π (e.g., randomly select one of four actions at each state), we can apply an iterative update approach to calculate the state value for each state using the Bellman Expectation Equation. This iterative process is known as policy evaluation, and the pseudocode for this approach is presented in the section below.

Policy Evaluation with the Bellman Expectation Equation

Ref: Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

In words, the above pseudocode works as follows:

1. Initialize the state value V(s) with some initial values.
2. Iterate through all states, applying the Bellman Expectation Equation, which uses the transition probability, to update the state value V(s).
3. Calculate the difference between the updated state value and the original state value.
4. Terminate the iteration if the difference is smaller than a specified threshold θ.

In the case of finite Markov Decision Processes (MDP), this iterative approach is guaranteed to converge to the true state value under a given policy, denoted as V^π, due to the Contraction Mapping Theory (the details of this theory are beyond the scope of this blog).

Now, let’s explore the implementation of policy evaluation using the Bellman Expectation Equation.

Now, let’s calculate the state values for three different policies:

1. Moving left at all states
2. Moving right at all states
3. Optimal actions

With the first policy of consistently moving left in all states, we would expect that the agent does not receive any reward. Left actions do not lead to the goal in any cell. Consequently, since the agent doesn’t receive any rewards, the state value would be zero for all states under this policy.

Conversely, when we use the second policy of always moving right at all states, we can anticipate receiving a reward at the cell adjacent to the goal state. Consequently, some of the state values will have positive values.

Let’s examine how these values differ when we transition to the optimal actions as our policy.

As expected, the state values for the first policy are all zero, while the state values for the second policy contain some positive values. But why does the cell (2,1) also have a positive value in addition to the cell (3,1)? Remember, our state value equation is as follows:

The state value is determined by the summation of the immediate reward r and the state value at the next state V^π(s’). So the positive value at cell (2,1) is a result of the positive state value at cell (3,1).

Furthermore, we notice that the optimal policy yields significantly higher state values. These values gradually decrease as we move farther from the goal state due to the presence of the discount factor γ. For instance, at the starting state, the state value is 0.59 because it takes 5 steps to reach a state with an expected reward of 1, resulting in a discount of γ⁶ = 0.59. Consequently, states closer to the actual reward are valued more than those farther from it.

We have successfully computed the state values for each state. However, we still don’t know about how to find the optimal policy for maximizing the total reward, although we manually created it in the above section. Utilizing the state value function computed above, we can actually extract a better policy than the original one by taking actions that lead the agent to states with the maximum state values from each cell. Let’s put this to the test.

The code below determines the optimal policy based on the estimated state values with the “move right” policy.

With the state value updated with move right policy, we extracted an improved policy by taking action maximizing the value at each state. When the values are the same for multiple actions, an action is randomly chosen to update the policy.

As we observed in the above comparison of the state value estimates for three different policies, the state values vary depending on the policy in use. Thus, the values under this updated policy will be different from the values with move right policy. By iteratively performing policy evaluation and policy update, we will gradually approach the optimal policy. This technique is known as policy iteration.

Below, you’ll find a pseudocode outlining the steps of policy iteration.

Policy Iteration

Ref: Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

Policy evaluation follows the same principles as what we learned earlier. In words, the above pseudocode works as follows:

1. Complete the policy evaluation process to obtain the state values under the current policy.
2. Iterate through each state and update the action to be taken at each state by selecting the action that maximizes the value of the state.
3. Repeatedly update the policy until it no longer changes from the previous iteration.
4. Once the policy is fully updated, perform policy evaluation again using the updated policy.
5. Repeat these processes in an iterative manner.

Through these iterative policy evaluation and improvement steps, we progressively improve our policy. Similar to policy evaluation, the convergence towards the optimal policy is guaranteed, by the Contraction Mapping Theory.

Now, let’s implement the policy iteration algorithm.

Within the above policy iteration function, the value function is initialized as all zeros and policy is initialized as all left action policy.
Let’s see if this iterative approach can find an optimal policy or not.

The algorithm converged to the same value function as the value function we found with the optimal policy. The learned policy is also the same as the optimal policy we defined before except differences due to the randomness among equally good actions.

Value iteration

In RL, there is another crucial equation for computing state values, known as the Bellman Optimality Equation.

Bellman Optimality Equation

Unlike the Bellman Expectation Equation, which is based on a specific policy π, the Bellman Optimality Equation calculates values under the optimal policy. The value iteration algorithm offers an iterative approach to discover the optimal value function and policy using the Bellman Optimality Equation.

Below is a pseudocode outlining the value iteration process:

Ref: Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

Let’s implement and test this algorithm.

Both approaches converge to the same value function and policy. However, the choice between the two algorithms depends on the specific scenario. In general, value iteration tends to be more computationally efficient because it involves only one loop to update the value function. Consequently, when dealing with large state or action spaces, value iteration is the preferred choice.

Conversely, in practical applications, policy iteration is more likely to converge to the optimal value and policy functions. Therefore, when you aim for highly accurate estimates and have sufficient computational resources, policy iteration might be a good option.

Summary

In this blog, we covered the concepts of policy evaluation, policy iteration, and value iteration, all of which fall under the umbrella of Dynamic Programming. These iterative algorithms rely on having knowledge of the environment’s dynamics, specifically the transition function and reward function. However, in real-world scenarios, we often lack precise information about the environment’s dynamics. Instead, we aim to learn these dynamics through the agent’s interactions and experiences.

In the next blog, we will explore methods for learning the value function and policy directly from the agent’s experiences.

Part 2: Monte Carlo Methods

Part 3: TD-learning

Codes

Reference

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

--

--

Yuki Minai

Ph.D. student in Neural Computation and Machine Learning at Carnegie Mellon University, Personal webpage: http://yukiminai.com