Find an optimal policy with Finite Markov Decision Process: Part3 TD-learning

9 min readNov 20, 2023

In this series of blogs, we will delve into various methods for finding an optimal policy within the context of Finite Markov Decision Processes (MDPs).

Series Overview

The series is structured into three parts, each focusing on a different approach for solving MDPs:

Part 1: Dynamic Programming

Part 2: Monte Carlo Methods

Part 3: TD Learning (This blog)

In Part2, we learned Monte Carlo (MC) methods to solve the prediction problem, where we estimate the value function associated with a specific policy, and the control problem, where we aim to discover an optimal policy, through interactions with the environment. A distinguishing characteristic of MC methods lies in their reliance on sampled returns observed during these interactions to update the value function. As a consequence of this requirement, each episode must reach its terminal before any learning can take place. Learning only occurs after the completion of an episode.

However, do we really have to wait until the end of the episode? Are there more efficient methods for acquiring knowledge about the value function and policy? In particular, can we acquire knowledge during an episode or even when an episode is not yet terminated? In this blog, we will explore the answers to these questions by introducing the concept of Temporal Difference (TD) learning.

Prepare Frozen Lake environment

As in the last blog, we will use the frozen lake environment in gymnasium.

Learn more about the Frozen Lake environment

The objective of this environment is for an agent to navigate through a grid world, starting from the initial cell and reaching the goal cell. Here, we are using a 4x4 grid map, and each cell falls into one of four different categories:

S (Start): This cell is where our agent begins its journey.
F (Frozen): These cells are safe for the agent to walk on.
H (Hole): These are hazardous cells, and if the agent falls into one, the episode terminates with a reward of 0.
G (Goal): Reaching this cell yields a reward of +1 for the agent.

From the starting cell, the agent has the option to move in four directions: up, left, down, or right. The agent’s task is to explore the grid world, making decisions at each time step to eventually reach the goal cell and collect a reward of +1.

In the below code, we can see an example of the agent randomly exploring this environment over 100 time steps.

TD(0)

As we discussed briefly in the introduction, TD-learning is a method that acquires state/action values through experiential learning. Instead of relying on sampled returns, it utilizes estimated state/action values in future states to update the current state/action values. This technique, known as bootstrapping, is achieved through the following update formula:

Here, V(S_t) represents the state value for the current state S_t, α denotes the step size or learning rate, R_{t+1} stands for the immediate reward at the next state, γ is the discount factor for future rewards, and V(S_{t+1}) represents the state value at the subsequent state S_{t+1}.

The term R_{t+1} + γ V(S_{t+1}) informs us about the value of the current state as the summation of the immediate reward and the discounted state value at the next state(i.e the expected total reward after state S_{t+1}.

R_{t+1} + γ V(S_{t+1}) — V(S_t) is referred to as TD-error, which quantifies the difference between the estimated value of S_t and the expected value R_{t+1} + γ V(S_{t+1}).

It’s important to note that, in contrast to MC methods, which use the average of the actual sampled return G after visiting a state, TD-learning incorporates the state value of the next state in the update of the current state value.

Now, let’s learn the details of the algorithm. Below, you can find the pseudo-code for TD-learning, which is employed to estimate the state value function under a policy π.

Ref: Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

It’s worth noting that, since we use the sampled reward R at a next time step in our update process, this update occurs after taking a step from state S and observing the next state S’ and reward R.

This TD-learning method is referred to as TD(0) or one-step TD because it solely utilizes the immediate reward at the next step, denoted as R. TD(1), on the other hand, incorporates not only the immediate reward at the next state but also the reward one step further, resulting in the following update formula:

To execute this update, one must wait for the observation of the next two rewards. Likewise, TD(∞) requires waiting until the end of an episode and does not make use of the next state value. In fact, this corresponds to the Monte Carlo (MC) method, which relies on the entire history of sampled rewards for state/action value updates.

From a mathematical perspective, we can express the following relationship for the value function estimate:

MC methods employ an estimate of the first expression with G_t as the target, whereas TD learning (as well as Dynamic Programming) utilizes an estimate of the last expression with R_{t+1} and v_π(S_{t+1}) as the target.

Similar to MC methods, for any fixed policy π, TD(0) exhibits asymptotic convergence to v_π provided that a constant step-size parameter is sufficiently small. With a decreasing step-size parameter following the usual schedule, TD(0) converges with a probability of 1.

Now, let’s see the implementation of the above TD(0) for estimating the state values.

Let’s check the estimated state values for one of the optimal ε greedy policy.

The learned state values demonstrate that cells closer to the goal have higher values than those farther from the goal, which aligns with our expectations regarding value estimation because we have the discount factor. Thus, we can effectively estimate state values using TD-learning.

On-policy TD control (SARSA)

Now, let’s consider how to learn the optimal policy using TD-learning. As we discussed in the last blog, there are two fundamental approaches for learning the optimal policy through interactions with the environment: on-policy and off-policy methods.

The on-policy method employs a single policy for both behaving in the environment and defining the optimal policy. In contrast, the off-policy method utilizes two distinct policies. It employs a behavior policy to interact with the environment and learns a target policy to optimize behavior.

TD-learning can be employed with both of these approaches. Let’s begin by examining the on-policy TD control scenario, particularly focusing on SARSA (State-Action-Reward-State-Action).

It’s essential to remember that when the environment’s dynamics are unknown, and the agent must learn through interaction with the environment, simply learning state-action values isn’t sufficient to determine the optimal policy. This is because even if we know the state value, we lack the knowledge of how to reach that state without understanding the dynamics. Instead, our objective is to estimate action values to identify the optimal policy. By learning the action values, we can define our policy by taking an action with the max value at each state.

The update formula for action values with SARSA closely resembles that for state values, but we replace the state value estimate, represented as V, with the action value estimate, denoted as Q:

If S_{t+1} is a terminal state, then Q(S_{t+1}, A_{t+1}) is defined as zero.

Below is the pseudocode for implementing SARSA to estimate action values.

Since SARSA is an on-policy method, it uses a single policy derived from the action value function Q. This policy is used to make action choices and update the Q-values.

Unlike Monte Carlo (MC) methods, SARSA (a TD-learning method) updates action values without waiting for the episode termination. For instance, if a policy were to trap the agent in the same state, an episode would never end and no value update happens in the case of MC methods. However, TD-learning methods like SARSA don’t encounter this issue. They quickly learn during the episode, recognizing that such policies are suboptimal and switching to alternative strategies.

Let’s see the implementation of SARSA.

In the above code, we plot the learned action values for state (3,2) as an example. In this cell, the optimal action is to move downward. Moving right is considered the least favorable action due to the presence of a hole. Moving up and left takes the agent away from the goal, making it a suboptimal choice. The learned action values reflect these expectations.

Q-learning

Next, we learn off-policy TD learning, a method known as Q-learning. Q-learning is an off-policy learning approach that employs two distinct policies: a behavior policy and a target policy.

Just as a reminder, as we discussed in the last blog, in order to learn the optimal policy through interaction with the environment, we must visit all states many times. This requires the exploration of all states with our behavior policy, which results in an opportunity loss as we continue exploring actions even after identifying the optimal actions for each state.

Off-policy methods address this challenge by employing two different policies.

Below is the pseudocode of Q-learning.

In this pseudocode, a behavior policy is a soft policy such as epsilon greedy. Our target policy is the greedy policy as we can see max in Q-value update formula. We choose an action that has max action value at the next state S’. This update equation is basically the same as SARSA except for this max term. This max term makes Q-learning an off-policy learning method.

SARSA

Q-learning

Let’s see the implementation of Q-learning with this update equation.

The above code is exactly the same as SARSA implementation except for the update equation. The learned action values by Q-learning also reflect our expectations.

Summary

In this blog, we reviewed TD-learning for solving Markov Decision Processes (MDPs). The key difference between MC methods and TD-learning is the value update strategy. While MC methods use the average of sampled returns to estimate the value function, TD-learning uses the estimate of the future state value to estimate the value of the current state. TD-learning’s advantage lies in its ability to promptly learn during an episode without waiting for its termination, as it doesn’t hinge on sampled returns. However, this agility comes at the cost of potential noise, leading to quick changes in decisions.

Added:
While I was reviewing this topic again, I wondered “Why does TD-learning use important sampling but not TD-learning?”
This link helped me to understand the answer.

Part 1: Dynamic Programming

Part 2: Monte Carlo Methods

Codes

Reference

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

Find an optimal policy with Finite Markov Decision Process: Part3 TD-learning

Series Overview

Prepare Frozen Lake environment

TD(0)

On-policy TD control (SARSA)

Q-learning

Summary

Codes

Reference

Written by Yuki Minai