Intro to Reinforcement Learning: Monte Carlo to Policy Gradient

Henry Wu
16 min readFeb 15, 2024

--

This post is an intro to reinforcement learning, in particular, Monte Carlo methods, Temporal Difference Learning, Deep Q-learning, Policy Gradient methods, and Advantage Actor-Critic. It draws primarily from Sutton & Barto Book, supplemented with several RL papers.

Photo by Dominik Scythe on Unsplash
Figure 1: The agent-environment interaction in a Markov decision process. [1]

The main idea of reinforcement learning is that we have an agent who interacts with the environment by performing an action (A) at each time step t. The environment then responds to the action with a new state (S) and reward (R). The agent takes the new state and reward into consideration when performing yet another action. This process repeats until the episode ends, or it can continue indefinitely, depending on the type of task. The agent’s goal is to perform actions that will maximize the total reward it receives over the long run.

Return

How do we define the total reward or return received over the long run?

After some time steps of agent-environment interaction, we obtain a sequence (trajectory) of states, actions, and rewards.

In general, we aim to maximize the expected return, where the return, G_t, is defined as a specific function of the discounted reward sequence with a discount rate 𝛾 (0 ≤ 𝛾 ≤ 1). The rewards are discounted because we might want to give greater weight to sooner rewards than those in the future.

Now that we know the agent wants to maximize the expected return, let's discuss how it accomplishes this.

Policy

The agent uses a policy to select actions given the states. Policy is a mapping from states to the probabilities of selecting each possible action. The policy is what we want to learn and optimize so that it maximizes the expected return over the long run.

How do we find a policy that will accumulate a lot of rewards? Let’s first look at values functions.

Value Functions

Value functions estimate the expected return for the agent in a given state or performing a given action in a given state.

There are two types of value functions:

State-Value Function

The state-value function represents the expected return when starting in state s and following policy π thereafter.

Action-Value Function

The action-value function represents the expected return when starting in state s, taking action a, and following policy π thereafter.

Optimal Policies and Value Functions

The goal of a reinforcement learning task is to find a policy that accumulates a lot of rewards over the long run. We want to find an optimal policy π*, which is a policy that achieves the highest possible expected return for all states. It turns out optimal policies π* can be defined by their value functions, since:

Optimal policies π* share the same value functions, called the optimal value functions:

Optimal State-Value Function: v*

Optimal Action-Value Function: q*

All these value functions obey special self-consistency equations known as Bellman equations. The Bellman equation expresses the relationship between the value of a state and the values of its successor states, which forms the basis for approximating the value functions.

Bellman Optimality Equation for v*

The Bellman optimality equation for v* expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state.

With the Bellman optimality equation for v*, we can easily determine the optimal policy π*, since for each state s, there will be one or more actions at which the maximum is obtained.

Bellman Optimality Equation for q*

With the Bellman optimality equation for q*, it is even easier to determine the optimal policy π*, since for each state s, one can simply find any action that maximizes q*(s, a).

Explicitly solving the Bellman optimality equation is one way of finding an optimal policy, but it is rarely practical since it requires an exhaustive search of all possibilities, computing their probabilities of occurrence and their desirabilities in terms of expected rewards. Most methods approximate solving the Bellman optimality equation using experienced transitions instead of knowledge of the expected transitions.

Different RL Approaches

Let’s explore various RL approaches before diving into specific algorithms.

[source]

Model-based vs. Model-free

  • Model-based methods: The agent has access to, or learns, a model of the environment and then uses this model to make decisions. A model of the environment refers to a function that can predict state transitions and rewards.
  • Model-free methods: The agent learns directly from experience or trial-and-error and uses the feedback to update their internal policies or value functions.

Policy-based vs Value-based

  • Policy-based methods: The agent learns the policy directly.
  • Value-based methods: The agent learns a value function that gives the expected return of being in a given state or performing a given action in a given state. The policy can then be derived from the learned value function.

On-policy vs. Off-policy methods

In RL, the agent generates experience by interacting with the environment under a certain policy and then learns the policy from those experiences.

  • On-policy methods: The agent attempts to learn the policy that is also used to generate the experience.
  • Off-policy methods: The agent learns a policy that is different from the one used to generate experiences.

Generalized Policy Iteration (GPI)

Almost all RL methods adhere to the GPI paradigm. Policy iteration is a method to find the optimal policy by iteratively doing two general steps:

  • Policy Evaluation (E): We learn the value function under the current policy.
  • Policy Improvement (I): We use the evaluated value function to create a slightly improved policy.

GPI refers to the general idea of letting a policy π and a value function v improve upon each other iteratively until there is not much improvement to be made.

Monte Carlo Methods

Monte Carlo methods are ways of estimating value functions through experiences (sample sequence of states, actions, and rewards). Recall that the value of a state is the expected return (expected cumulative future discounted reward) starting from that state. One way to estimate the expected return is simply to average the returns observed after visits to that state. As more returns are observed, by the law of large numbers, the average should converge to the expected return. This idea underlies all Monte Carlo methods.

Let’s consider Monte Carlo methods for approximating optimal policies π*.

On-policy MC

  1. On-policy MC uses an ε-soft policy. A soft policy refers to any policy that has a small, but finite, probability of selecting any possible action, ensuring exploration of alternative actions.
  2. After each episode, the observed returns are used to learn the action-value function, and then the policy is improved based on the learned value function for all the states visited in the episode.

Off-policy MC

In on-policy MC, our policy plays two roles: generating trajectories through exploration and learning the optimal policy. On-policy MC is a compromise as it learns action values not for the optimal policy, but from a near-optimal policy that still explores.

Off-policy MC uses two policies: one for learning the optimal policy, called the target policy, and the other for exploration and generating trajectories, called the behavior policy. It follows the behavior policy while learning about and improving the target policy.

The complete algorithm:

  1. Off-policy MC uses policy b, which can be any ε-soft policies, as the behavior policy, to explore and generate trajectories. Note that we wish to estimate the expected returns under the target policy π, but all we have are returns from the behavior policy b, which gives us the wrong expectation. Through importance sampling, we can estimate expected values under one distribution given samples from another.
  2. The algorithm uses weighted importance sampling, which calculates a weighted average of returns according to the relative probability of their trajectories occurring under the target and behavior policies.

With Monte Carlo methods, one must wait until the end of an episode before updating values. If episodes are long, learning can become slow.

Temporal Difference Learning (TD)

Both TD and Monte Carlo (MC) methods use experience to solve the prediction problem. In MC, an episode must be completed before values can be updated, which can be inefficient if episodes are long. TD methods, in contrast, need to wait only until the next time step or the next n-steps, before values can be updated. For simplicity, we will focus on one-step TD.

Difference between MC and TD updates:

MC methods use the actual return, G_t, following time t, meaning they must wait until the end of the episode to determine the update to V(S_t). TD methods only need to wait until the next time step. At time t+1, TD methods immediately form a target and make a useful update using the observed reward R_t+1 and the estimate V(S_t+1). The one-step return is often superior to the actual return in terms of its variance, even though it introduces bias.

TD methods update their estimates based in part on other estimates, a process you can view as learning a guess from a guess. It has been proven to converge asymptotically to the correct predictions.

Sarsa: On-policy TD

In Sarsa, we learn the action-value function Q that directly approximates q*. Here is what the Q update looks like:

The complete algorithm:

  1. In Sarsa, an action is selected using an ε-greedy policy derived from the current estimates of the action-value function Q.
  2. During Q-update, it forms the target Q value using the action selected by the ε-greedy policy for the next state. The update is performed online after every state transition within each episode.
  3. It continually estimates the action-value function q_π under the behavior policy π (ε-greedy), while simultaneously adjusting π to more closely align with the greedily chosen actions based on the current Q-values.

Q-learning: Off-policy TD

[paper]

In Q-learning, the learned action-value function Q directly approximates q*, independent of the policy being followed. It is considered off-policy because the use of the max operator assumes a different policy. Here is what the Q update looks like:

The complete algorithm:

  1. In Q-learning, an action is selected using an ε-greedy policy derived from Q.
  2. During Q-update, it forms the target using the max Q-value over all possible actions in the next state, which is a different policy from the one used to select actions.

In Q-learning, the algorithm updates the Q-values by selecting the maximum estimated Q-value among all possible actions in the next state to represent the expected return. It uses the same samples both to determine the maximizing action and to estimate its value, which can lead to a significant maximization bias.

Double Q-learning

[paper]

Double Q-learning aims to mitigate maximization bias by learning two independent estimates, Q_A and Q_B, where each Q function is updated using a value from the other Q function for the next state. The two Q functions learn from separate sets of experiences, but one can use both value functions to select an action.

The complete algorithm:

  1. In Double Q-learning, it uses two independent action-value functions, Q_A and Q_B to estimate q*.
  2. It selects an action a from state s using a policy that is based on both action-value functions.
  3. During updates, it alternates the roles of the two Q estimates randomly to stabilize learning. One Q function will be used to select the best action (based on its current estimate), while the other Q function provides the value estimate for that action.

So far we have described methods where the state and action spaces are small enough for the approximate value functions to be represented in tables. For example, Q-learning uses a Q table to store the Q value for each state-action pair. However, these approaches can become impractical in environments with a very large number of states or actions.

Approximate Solution Methods

In the remainder of this post, we will look at methods applicable to problems with arbitrarily large state spaces (e.g., images). In many target tasks, it is likely that almost every state encountered will be unique and never have been seen before. To make sensible decisions in these novel states, it is necessary to generalize from past experiences with different states that are, in some sense, similar to the current one.

Our challenge lies in generalization. How can experience with a limited subset of the state space be usefully generalized to produce a good approximation across a much larger subset? In reinforcement learning, the generalization technique we need is often called function approximation. This technique takes examples from a desired function (e.g., a value function) and attempts to generalize from them to construct an approximation of the entire function.

Deep Q-Learning (DQN)

[paper]

As we learned from Q-learning, maintaining and updating a Q-table becomes impractical in large state space environments. This leads us to Deep Q-learning. Instead of using a Q-table, Deep Q-Learning uses a neural network that takes a state and approximates Q-values for each action based on the given state.

The goal is to use a function approximator to estimate the action-value function:

We train a Q-network, represented as a neural network, by minimizing a sequence of loss functions L_i(θ_i) that change with each iteration i, where y_i represents our target and p(.) is the behavior distribution:

Differentiating the loss function with respect to the weights θ_i gives the following gradient (θ_i-1 are held fixed):

equation 3

The complete algorithm:

  1. DQN uses a Q-network to approximate the Q-values for each possible action, alongside a replay memory D for experience replay.
  2. Experience replay stores each step of the experience into the replay memory. This memory can then later be used for multiple weight updates.
  3. During training, it selects a random minibatch of sample transitions from the replay memory for Q-learning updates.
  4. Actions are selected using an ε-greedy policy, which balances exploration and exploitation.

Experience replay with random minibatch allows for greater data efficiency and it breaks the strong correlations between the samples and therefore reduces the variance of the updates. DQN is considered an off-policy method since it learns from sample transitions generated by older policies.

Double Deep Q-learning (Double DQN)

[paper]

Deep Q-learning suffers from maximization bias as it uses the same values both to select and evaluate an action. The issue is addressed by applying the principles of double Q-learning, which uses a separate target action-value function to estimate target values. It separates the max operator in the target into two parts: action selection and action evaluation.

The complete algorithm:

  1. Double DQN uses two networks: the online network Q for policy evaluation and action selection, and a target network Q^ to estimate the target value.
  2. During the Q-update, the target network Q^ is used to estimate the target value.
  3. The target network weights are periodically (every C steps) updated with the weights from the online network. This periodic update ensures that the target values gradually adjust to reflect the latest policy while maintaining enough stability to reduce overestimations.

Policy Gradient

In this section, we explore methods that directly learn a parameterized policy π that can select actions without consulting a value function. Policy gradient methods aim to directly optimize the cumulative reward by optimizing parametrized policies with respect to expected returns. The goal is to learn the policy parameter based on the policy gradient, which generally takes the following forms:

[source]

These choices of Ψ all lead to the same expected value for the policy gradient, despite having different variances. The advantage function gives the least variance and is extremely common in policy gradient methods. It measures how much better an action is compared to the average in a given state.

However, in practice, the advantage function is not known and must be estimated. Later, we will discuss methods that utilize advantage functions.

REINFORCE: Monte Carlo Policy Gradient

[paper]

REINFORCE is a Monte-Carlo method that uses the complete return of the entire episode under its current policy π. It updates the policy parameter, θ, using the sampled return through stochastic gradient ascent.

The complete algorithm:

  1. In REINFORCE, it generates sample returns by following the parameterized policy π.
  2. It updates the policy parameter θ using the sampled return through stochastic gradient ascent.

As a Monte Carlo method, REINFORCE has a high variance given the stochasticity of the environment and the policy.

REINFORCE with Baseline

REINFORCE with Baseline includes a comparison of the return to a baseline b(s). The baseline, in general, leaves the expected value of the update unchanged, but can significantly reduce variance, speed up learning, and do so without introducing bias.

The complete algorithm:

  1. REINFORCE with baseline initialized with a parameterized policy π and a parameterized state-value function v^ as the baseline.
  2. The target now includes a comparison of the return against the baseline.
  3. The state-value function (baseline) is learned through SGD.

Actor-Critic Methods

Actor-critic methods aim to reduce variance by combining the benefits of policy-only and value-only approaches. The critic estimates a value function, which is then used to adjust the actor’s policy parameters. When the value function is used to assess actions in this manner, it is called a critic, and the overall policy-gradient method is called an actor–critic method.

REINFORCE with baseline can be a one-step actor-critic method by replacing the full return with the one-step TD return.

It is important to note that REINFORCE with baseline itself is not an actor-critic method as the value function is not used to evaluate the action and guide the policy update.

Example algorithm:

  1. One-step actor-critic uses a parameterized state-value function as the critic to evaluate the policy’s performance and as a baseline.
  2. The state-value function is learned through semi-gradient methods. Semi-gradient methods update the parameters of the state-value function by taking into account the gradient of the function with respect to its parameters, but not the gradient with respect to the target.

Asynchronous Advantage Actor-Critic (A3C)

[paper]

A3C introduces a framework for deep reinforcement learning that uses asynchronous gradient descent to optimize deep neural networks. It addresses the problem of non-stationary data and highly correlated updates in online learning by asynchronously executing multiple agents in parallel. The framework relies on parallel actors exploring different policies to stabilize learning and to decorrelate the agents’ data.

For the setup, A3C eliminates the need for separate machines and a central parameter server by using multiple CPU threads on a single machine, thereby reducing communication costs of sending gradients and parameters.

The complete algorithm:

  1. A3C maintains a policy π(a|s;θ) and an estimate of the state-value function V(s;θ) that are shared across all threads. Each actor-learner (thread) has a copy of the environment and different exploration policies (e.g., by varying ε) for diversity.
  2. It uses n-step returns to update both the policy and the state-value function after every t_max time steps or when reaching a terminal state. It also shares some layers between the policy and the state-value function.
  3. During the update, it uses the advantage function, A(a, s), which is estimated by:

4. It uses parallel actor learners and accumulated updates to improve training stability.

Advantage Actor-Critic (A2C)

[blog]

A2C is a synchronous and deterministic implementation of A3C. It waits for each actor to finish its segment of experience before updating the parameters collectively. This method allows for more effective use of GPUs, which works well with large batch sizes and complex policies. It has been observed to outperform the asynchronous implementations with no evidence that the noise introduced by asynchrony provides any performance benefit.

References

--

--