Find an optimal policy with Finite Markov Decision Process: Part2 Monte Carlo Methods

Yuki Minai
14 min readNov 20, 2023

In this series of blogs, we will delve into various methods for finding an optimal policy within the context of Finite Markov Decision Processes (MDPs).

Series Overview

The series is structured into three parts, each focusing on a different approach for solving MDPs:

Part 1: Dynamic Programming

Part 2: Monte Carlo Methods (This blog)

Part 3: TD Learning

In Part1, I introduced a method for determining the optimal policy using Dynamic Programming. This approach involves iteratively updating the value function and policy with the Bellman equation. However, this update operation relies on having knowledge of the environment’s dynamics.

In reality, we often lack complete knowledge of these dynamics. In this blog, we will explore a method for solving Finite Markov Decision Processes (MDP) without requiring knowledge of the environment. Instead, we will focus on learning a value function and policy through experience by collecting sequences of states, actions, and rewards obtained from interactions with the environment. Specifically, we will learn Monte Carlo Methods, which solve finite MDPs by averaging sample returns.

Prepare Frozen Lake environment

As in the last blog, we will use the frozen lake environment in gymnasium.

Learn more about the Frozen Lake environment

The objective of this environment is for an agent to navigate through a grid world, starting from the initial cell and reaching the goal cell. Here, we are using a 4x4 grid map, and each cell falls into one of four different categories:

  • S (Start): This cell is where our agent begins its journey.
  • F (Frozen): These cells are safe for the agent to walk on.
  • H (Hole): These are hazardous cells, and if the agent falls into one, the episode terminates with a reward of 0.
  • G (Goal): Reaching this cell yields a reward of +1 for the agent.

From the starting cell, the agent has the option to move in four directions: up, left, down, or right. The agent’s task is to explore the grid world, making decisions at each time step to eventually reach the goal cell and collect a reward of +1.

In the below code, we can see an example of the agent randomly exploring this environment over 100 time steps.

Estimating State Values with Monte Carlo Methods

In this section, we’ll explore how Monte Carlo (MC) methods can be used to estimate the state values in a MDP when we don’t have prior knowledge of the environment’s dynamics.

Let’s assume that our policy is a random action policy, meaning it doesn’t follow any specific strategy. Here, we assume that we lack knowledge about the dynamics of the environment. However, we can collect experience by repeatedly executing actions according to this random policy. Through these experiences, we can identify which states are associated with receiving rewards.

Definition of Return

In this context, we define the “return” as the total discounted reward, starting from the current timestep. Mathematically, return is represented as:

Here, G_t represents the return at time t, R_t is the reward at time t, and γ is the discount factor. It accounts for the fact that future rewards are worth less than immediate rewards.

Monte Carlo Methods for Value Estimation

MC Methods utilize the concept of return to compute the value of each state. States that lead to higher total discounted rewards in the long run will be assigned greater state values.

More precisely, MC methods calculate the state value by taking the average of the observed returns obtained after visiting each state. As more returns are sampled, this average tends to converge to the expected value of the return that can be obtained from the state.

Below is the pseudocode for estimating the state value under a specific policy π using MC methods:

Ref: Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

The general idea can be summarized in the following steps:

1. Generate a Policy π: Choose or generate a policy π (e.g., random exploration, ε-greedy policy).

2. Initialization: Initialize state values and return variables for each state.

3. Episode Generation: In each iteration, generate an episode.

4. Backward Return Calculation: Iterate through time steps from the endpoint of the episode backward and compute the return G after each time step t. This backward calculation simplifies the computation of the return.

5. Update State Values: If time step t marks the first visit to a state s, append the current return to a list of returns for state s, and update the state value V(S_t). Here, S_t is the state visited at time step t. The update is done by computing the average return of state s over episodes.

6. Repeat: After iterating through all time steps in an episode, return to step 3 and continue the process.

In summary, the MC method computes state values by averaging the expected returns after visiting state s over many episodes.

While the method outlined above relies on the first visit to a state for updating state values (first-visit MC), there’s an alternative approach known as every-visit MC. In the every-visit MC method, state values are updated based on every visit to the state, rather than just the first visit. The first-visit MC method estimates V_π(s) as the average of the returns following the first visits to state s, whereas the every-visit MC method averages the returns following all visits to state s. In this blog, we will primarily focus on the first-visit method, which has been the most widely studied.

With many episodes, the state value V_π(s) converges to the true values, following the Law of Large Numbers. Let’s see the implementation of this method. For our policy π to evaluate, we will use an optimal ε-greedy policy, where we exploit the optimal action (manually chosen) with probability 1-ε and explore random actions with probability ε.

From the learned state values, we observe that cells closer to the goal exhibit higher values compared to those farther from the goal, which makes sense because we have the discount factor. So we can estimate the state values using MC methods.

Estimating Action Values with Monte Carlo Methods

Now, the key question is how we can learn the optimal policy without knowledge of the dynamics. While state values indicate which states have higher values, they don’t provide information about the actions required to reach these high-value states. To illustrate this, consider a chessboard where we know the value of each state, suggesting good states we aim to reach. However, in the absence of knowledge about the dynamics, we lack guidance on which actions to take to attain those desirable board states.

In scenarios where we do know the dynamics, we can simulate the environment by taking actions and observing the outcomes (i.e. look ahread), enabling us to learn which actions lead to favorable results. In other words, with access to the dynamics, state values alone suffice to determine a policy by looking one step ahead and selecting the action that yields the highest value.

However, in the absence of a model, state values alone are insufficient. It becomes essential to explicitly estimate the value of each action to make these values meaningful in guiding policy decisions. Thus, when we lack a model of the dynamics, estimating action values, instead of state values, becomes a valuable approach.

Action value, denoted as Q(s, a), quantifies the value of taking a specific action a in state s.

In the following code, we’ll rearrange the previous implementation to estimate action values using the Monte Carlo method. This process involves some minor modifications. Our objective is to estimate q_π(s, a), representing the expected return when starting in state s, taking action a, and subsequently following policy π.

In the above code, we plot the learned action values for state (3,2) as an example. In this cell, the optimal action is to move downward. Moving right is considered the least favorable action due to the presence of a hole. Moving up and left take the agent away from the goal, making it a suboptimal choice. The learned action values reflect these expectations.

For action value estimation, a ‘visit’ is defined as a combination of both a state and an action. The average return is computed for each such state-action pair. Similar to state value estimation, as a result of the law of large numbers, the estimate converges to the true value as the number of visits to each state-action pair approaches infinity.

It’s important to note that in order to estimate all state-action values accurately, all possible state and action pairs must be explored a sufficient number of times. To determine a good policy, it’s crucial to compare all potential actions at each state. Consequently, exploration of all state-action pairs and obtaining reliable value estimates are essential.

One approach to ensure exploration of all pairs is to randomize the initial state-action pair, a technique referred to as ‘exploring starts’. These random initial steps guarantee the exploration of all state-action pairs, provided there are a sufficient number of trials.

Another commonly employed solution, and the one used in the above implementation, is to utilize a policy that introduces stochasticity, with a nonzero probability of selecting all available actions in each state. For example, the ε-greedy policy, which combines exploitation (with probability 1-ε) and exploration (with probability ε), ensures this condition. The previous implementation employed an ε-greedy policy and it will converge to the true action values with an infinite number of trials.

On-policy MC control

We learned how to estimate action values for each state-action pair using the MC method. Now, let’s explore how we can identify the optimal policy by utilizing the estimated action values obtained through the MC method. The pseudo-code for this process is presented below.

Ref: Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

The above algorithm is similar to the Dynamic Programming approach we covered in the previous blog. However, there are two key distinctions:

  • Instead of utilizing state value estimates, we will work with action value estimates.
  • To ensure comprehensive exploration of states, we will employ an ε-greedy policy, which guarantees visits to all states with non-zero probability.

The following summarizes the main steps outlined in the pseudocode above:

1. Initialize a policy π, action values Q(s,a), and return values for each state-action pair Returns(s,a).
2. Generate an episode following policy π (continue until the episode terminates).
3. Iterate through the time steps from the end of the episode, updating the total return G at each time step.
4. If the time step t represents the first visit to a state-action pair S_t, A_t, add G at that time to Returns(S_t, A_t).
5. Calculate the average of all Returns for this state-action pair and update the Q value with the average.
6. Determine the best action to take at state S_t by taking argmax over action.
7. Update the policy at state S_t following an ε-greedy policy.

With this understanding, let’s implement this algorithm.

Because of the randomness by epsilon greedy policy, the learned policy fructuates at each run, but a learned policy represents one of the shortest paths to reach the goal from the start.

Off-policy MC prediction for estimating action values

In this section, we will learn Off-Policy MC methods, a different approach to reinforcement learning compared to the On-Policy MC methods we’ve explored so far. Off-Policy methods aim to separate the exploration policy from the target policy, allowing us to efficiently learn an optimal policy and deploy it for optimal behavior. Let’s explore how Off-Policy MC works.

On-Policy vs. Off-Policy

We’ve studied On-Policy MC methods, where the same policy is used for both collecting samples and updating the value function. However, this approach can be inefficient. The dilemma lies in the fact that we seek to learn action values based on optimal behavior but need to behave sub-optimally for exploration because exploration of all state-action paris are required.

For example, if we use an ε-greedy policy for both collecting and updating the value function, the agent continues to explore with a probability of ε even after fully learning the optimal actions. This exploration is necessary for accurately estimating the value function, which is why we employ a soft-policy like the ε-greedy policy.

Ideally, we want to distinguish between:

1. The policy for exploring the environment, ensuring that all state-action pairs have a non-zero probability of being visited (Behavior policy, b).
2. The policy we consider as the optimal action for each state (Target policy, π).

By doing this, we can deploy the target policy for optimal behavior once the learning process is complete. Off-Policy MC methods help us achieve this separation.

The Challenge of Off-Policy Estimation

With Off-Policy MC methods, we have two policies: a behavior policy (b) and a target policy (π). Our goal is to estimate action values under the target policy (π), but the samples we have are collected using the behavior policy (b), which may have a different distribution. To ensure convergence to the true value function under π, we need to explore all state-action pairs that π would visit using b. This requirement is known as the assumption of coverage.

The technique used to address this off-distribution problem is called importance sampling. Importance sampling is a general method for estimating expected values under one distribution given samples from another. It involves adjusting the weight of each sampled data to match the target policy π distribution.

The importance-sampling ratio is defined as:

This ratio calculates the relative probability of the trajectory under the target policy and the behavior policy. It transforms the returns to have the correct expected value under the policy π.

Action Value Estimation with Off-Policy MC and Important Sampling

Using the importance-sampling ratio and MC methods, we can estimate the action values under policy π using the data collected with policy b as follows:

Where |τ(s)| represents the number of time steps in which state s is visited. This method is called ordinary importance sampling.

There’s also another version known as weighted importance sampling:

Ordinary Importance Sampling vs Weighted Importance Sampling

The main difference between ordinary importance sampling and weighted importance sampling lies in their biases and variances. Ordinary importance sampling is unbiased, while weighted importance sampling is biased (though the bias converges asymptotically to zero). On the other hand, the variance of ordinary importance sampling can be unbounded, as the ratios may become very large. In contrast, the weighted estimator ensures that the largest weight on any single return is one, which typically results in lower variance.

In practice, the weighted estimator is often preferred due to its lower variance, and every-visit methods are commonly used as they simplify tracking visited states and are more amenable to approximation techniques.

For more detailed explanation, you can refer to page 105 of Ref.

The below section provides pseudocode for implementing Off-Policy Monte Carlo (MC) with every visit and weighted importance sampling.

Ref: Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

This method involves tracking the total weights for each state-action pair using the variable C to update the value function through weighted averaging. As we discussed, the key to this Offpolicy approach is the simultaneous use of both the behavior policy b and the target policy π.

We generate episodes using behavior policy b, and then we adjust the sampled data to match the distribution of the target policy π. This adjusted data is used to update the action value function. Let’s implement this algorithm.

In the above implementation, we use random exploration as a behavior policy and ε greedy policy as a target policy.
We plotted the learned action values for state (3,2) as an example. As we discussed, in this cell, the optimal action is to move downward. Moving right is considered the least favorable action due to the presence of a hole. Moving up and left takes the agent away from the goal, making it a suboptimal choice. The learned action values reflect these expectations although the result slightly fluctuates between each run.

Off-policy MC control for finding the optimal policy

Lastly, let’s think about the control problem with Off-policy MC methods.

As discussed earlier, Off-Policy MC methods allow us to use different behaviors and target policies. In the pseudocode below, we employ the weighted importance sampling and every-visit method. It’s crucial to note again that to accurately estimate the value function, we must visit all state-action pairs with our behavior policy. Therefore, we use a soft policy as the target policy. In the pseudocode, the target policy is the greedy policy, which always selects the action maximizing the return.

Ref: Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

It’s important to note that the numerator of the weight update is always 1 because our target policy is a greedy policy. In the updated policy, the probability for the target policy π to take the greedy action is 1.

Let’s implement this algorithm.

[Caution] If the value function is initialized with values that are larger than the return the algorithm can receive, it cannot properly update the policy.

An inherent limitation of the current approach is that it learns only from the tails of episodes, specifically when all remaining actions in the episode follow a greedy policy. This limitation can lead to slow learning, especially for states encountered in the early parts of lengthy episodes.

For instance, in the implementation described above, if random exploration policies are used instead of epsilon-greedy policies, the learning process can be considerably slower.

To mitigate this issue, one promising solution is to incorporate temporal difference (TD) learning. TD-learning, an algorithmic concept that we explore in the next blog, offers a solution to this challenge by allowing the agent to learn and update its value estimates more dynamically, even when non-greedy actions are prevalent.

Summary

In this blog, we reviewed a range of Monte Carlo methods for solving Markov Decision Processes (MDPs). As a summary, I provide the comparison between the Off-policy methods vs On-policy methods and the comparison between Monte Carlo Methods and Dynamic Programming.

Off-policy methods vs On-policy methods

When considering Off-policy methods versus On-policy methods:

  • On-policy methods are generally simpler to grasp and implement. They are well-suited for straightforward policy evaluation.
  • Off-policy methods, on the other hand, introduce additional concepts and notation, which can make them more complex. They often exhibit higher variance and slower convergence since the data is collected under a different policy.
  • However, Off-policy methods are more powerful and versatile, offering a broader range of applications. It’s worth noting that On-policy methods are a specific case where the target and behavior policies are identical.

Advantages of Monte Carlo over Dynamic Programming

Monte Carlo (MC) methods provide distinct advantages when compared to Dynamic Programming:

  • MC methods enable the learning of optimal behavior directly through interactions with the environment, without requiring a model of the environment’s dynamics.
  • These methods can be applied with both real interactions and simulated models, providing flexibility in various scenarios.
  • MC methods are particularly efficient when focusing on a specific subset of states, allowing accurate evaluations of regions of interest without the need for precise value estimations across the entire state space.
  • Notably, MC methods are often less affected by violations of the Markov property. This is because they do not update value estimates based on the value estimates of successor states, thus minimizing reliance on bootstrapping.

In the next blog, we will explore another valuable approach to learning value functions and policies, known as TD Learning.

Part 1: Dynamic Programming

Part 3: TD-learning

Codes

Reference

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

--

--

Yuki Minai

Ph.D. student in Neural Computation and Machine Learning at Carnegie Mellon University, Personal webpage: http://yukiminai.com