Policy gradient methods: From REINFORCE to Actor Critic

Yuki Minai
13 min readDec 15, 2023

The reinforcement learning methods we learned in previous articles such as Monte Carlo Methods, TD-learning, and Deep Q-learning learn value functions and use them to learn an optimal policy. There exists another category of methods that directly optimize the policy, rather than relying on the estimation of a value function. These methods are known as Policy Gradient Methods.

Policy Gradient Methods are especially useful, for example, when the policy is a more straightforward function to approximate compared to the value function. In such case, a policy-based method will typically learn faster and yield a superior asymptotic performance. In the opposite case where a value function may be simpler, estimating the value function might work better. Another important advantage of using policy gradient methods is that we can impose our prior on policy explicitly to construct the desired form of the policy.

In this article, we will learn three basic policy gradient methods, REINFORCE, REINFORCE with baseline, and Actor Critic. If we can understand the concept of REINFORCE, we can understand the other two just by adding a few improvements to it. We first go over some theoretical concepts to develop an understanding on policy gradient methods and explore the fundamentals of REINFORCE.

You can find the all codes used in this article here.

Prepare environment

In this article, we utilize the CartPole-v0 environment in Gymnasium.

Learn more about the CartPole-v0 environment

The goal of this environment is to balance a pole by applying forces in the left and right directions on the cart. It has a discrete action space:
- 0: Push cart to the left
- 1: Push cart to the right

Upon taking an action, either left or right, an agent observes a 4-dimensional state consisting of:
- Cart Position
- Cart Velocity
- Pole Angle
- Pole Angular Velocity

A reward of +1 is granted to the agent at each step while the pole is kept upright. The maximum reward an agent can earn in a single episode is 200.

The episode ends under the following conditions:
- Termination: Pole Angle is greater than ±12°
- Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)
- Truncation: Episode length exceeds 200 steps

In the code below, I provide an example of the agent randomly exploring this environment over 20 time steps.

Policy Gradient Methods Theory

Policy Gradient Methods directly learn a parameterized policy that can select actions without estimating a value function. In mathematical expression, this means to learn the policy’s parameter vector θ in the below expression

Here, π(a|s,θ) represents the probability of taking action a at time t given that the environment is in state s at that time, with parameter θ. Since our ultimate objective is to learn the optimal policy, it makes sense to directly optimize and learn it. But how do we achieve this?

A common approach is to use a parametric model, such as a neural network, as a policy estimator. This model outputs an action probability distribution given a state. But what would be the objective function to optimize the parameter vector θ?

For example, with Deep Q-learning (which is one of the value function estimation methods), we use Temporal Difference (TD) error to update the value function

By running backpropagation with this objective function, we can optimize the model estimating a value function. For more details about Deep Q-learning, you can refer to this article.

For policy gradient methods, we can consider that the objective is a discounted return we can gain through adopting the policy. We run an optimization process to maximize this objective function.

Can we compute the gradient of this return? No, we don’t exactly know the form of a return or reward function so we cannot directly compute the gradient. But Policy Gradient Theorem offers a way to compute the gradient of this function.

Policy Gradient Theorem

The Policy Gradient Theorem provides an analytic expression for the gradient of an objective (such as return) with respect to the policy parameter. Let’s denote the objective function we would like to maximize as J(θ).
Then, the policy gradient theorem provides the estimate of the derivative of this function as follows:

where ∇ J(θ) represents the gradient with respect to θ, π denotes the policy corresponding to parameter vector θ, and q_π(s,a) denotes the value of a state action pair under policy π. μ is the distribution of state s under π.
The detailed proof is offered in p.325 of Ref or this link.

Let’s think about what this expression means.
Remember that our goal is to maximize the objective function J(θ). So we would like to perform gradient ascent with respect to θ.
When a value of action a under state s is large (i.e. when q_π(s,a) is a large value), we would like to make this action more likely because we can gain a large return. To achieve this, we multiply the derivative term ∇ π(a|s, θ) with this large estimated value of this state to move largely toward this gradient ascent direction. This causes this action to happen more likely.
On the other hand, when the value of action a under s is a small value (or even a negative value), we move the gradient ascent direction with a smaller step (or move away from this direction) and don’t increase the probability of taking this action much (or reduce the probability of taking this action).

The first term ∑ₛ μ(s) controls the weight of each state. States that are more often visited have larger weights on the update of θ compared to the states that are rarely visited.

By iterating this update of θ, the algorithm learns which ascent direction it should go and gradually learns which action it should maximize the probability to maximize the objective function (i.e. return).

By the way, the proportionality (not equality) is all that is needed to perform a policy gradient update because any constant of proportionality can be absorbed into the step size α, which is otherwise arbitrary.

REINFORCE

Now, let’s learn about the most basic policy gradient algorithm: REINFORCE.
The Policy Gradient Theorem has a term ∑ₛ μ(s) to account for how often the states occur under the target policy π. If policy π is followed, then states will be encountered in the same proportions as policy μ(s). Thus we can rewrite the above expression as follows based on our sampled experience Sₜ:

With this expression, we can write our gradient ascent with sampled experience as follows:

where q^~_π is some learned approximation of q_π. The different algorithm uses a different estimate of this value function but for REINFORCE, it uses discounted return Gₜ as an estimate of q_π as in Monte Carlo Methods.

To compute this update equation based on the discounted return Gₜ, we want to make one more change. The above equation involves sum over all actions but we would like to replace it with action Aₜ we sampled at time step t. Then, we can approximate the value q_π(Sₜ, Aₜ) with the observed discounted return after visiting a state-action pair (Sₜ, Aₜ). How can we make this change?
Just with a few lines of math, we can rewrite the above gradient equation as follows:

Here, as we replaced the summation over all states with the expectation in the above section, we used a similar trick to replace the summation over all actions with the expectation. At each step, we observe Gₜ following π. By collecting many steps, the average of observed Gₜ will converge to the true value q_π(s,a).

In summary, the gradient equation we use for REINFORCE is defined as follows:

Let’s interpret this equation a little bit more. The update increases the parameter vector in the direction proportional to the return, and inversely proportional to the action probability.
The former suggests that it moves the parameters in the directions that favor actions that yield the higher return. The latter controls the gradient update for frequently observed actions not to be at an advantage. The update will be more often in their direction, so without this term, the frequently observed actions might win out even if they do not yield the highest return.

Note that as REINFORCE uses return Gₜ, the update has to wait until the end of the episode. Without the completion of the episode, the update cannot happen like Monte Carlo Methods.

Below is the pseudo-code of REINFORCE (a.k.a. Monte Carlo Policy Gradient Algorithm).

Let’s implement this algorithm. First, we define our model to estimate the policy. In this tutorial, we use a simple neural network model as a policy estimator.

Next, we prepare helper functions to generate an episode following the current policy model and evaluate the performance of the current policy.

Lastly, we implement a function to train the policy neural network.

Let’s test whether REINFORCE can learn a good policy to solve the cart pole environment. In the below code, we train the policy network for 3,500 episodes using 5 different random seeds. To monitor learning progress, we freeze the network weights every 100 episodes and perform 20 episodes with the current policy, evaluating the test performance of the policy at each training phase.

In the above plot, the line shows the average return over 5 seeds and the shading shows the range between the min and max of returns for 5 seeds.

The learned policy does not stably achieve the maximum return (200) on average within 3500 episodes and there are large performance variances (i.e. the performance is unstable). But we can see that the policy gets better by having more episodes with REINFORCE.

REINFORCE with baseline

One problem of REINFORCE is its high variance and thus produces slow learning. The REINFORCE algorithm estimates the gradient of the objective function using the expected returns by sampling trajectories. Because it relies on sampled trajectories, the gradient estimates can have high variance, leading to noisy updates. This high variance can result in large fluctuations in the policy parameters during training.

One idea to reduce this fluctuation is to include a comparison of the value to an arbitrary baseline b(Sₜ):

(Gₜ-b(Sₜ)) term is called advantage. The advantage represents how much better or worse the return by the current action is compared to the expected return at the state Sₜ. The baseline effectively reduces the scale of the return term in the policy gradient update. Since the advantage is multiplied by the gradient of the log probability of the action, a smaller advantage term results in smaller updates to the policy, leading to more stable policy gradient updates.

But is this a valid operation? In other words, even after subtracting the baseline, can the above equation be utilized as the estimate of the gradient of the objective function J(θ)? We can show that this is a valid operation because adding the baseline term does not actually change the value of the equation as follows:

The baseline can be any function, as long as it does not vary with a because the gradient update equation remains valid/same even with adding b(s). For example, we can use a constant value as a baseline for all states or some kind of regression model to estimate the average value at each state $s$.

When the baseline is uniformly zero, this update is the same as REINFORCE. Thus, this REINFORCE with baseline is a generalization of REINFORCE (or in other words, REINFORCE is a special case of REINFORCE with baseline).

Below is the pseudo-code of this update. In the below pseudo-code, the baseline state value is estimated by a parametric model with parameter $w$.

In the above pseudo-code, while we update the parameter for policy network θ, we also update the parameter for the state value prediction model w by using the advantage. The state value estimate parameter update uses the mean squared loss to minimize the prediction error between the predicted baseline value function and observed return.
Note that in this algorithm, we have two different learning rates for each parameter, α^θ and αʷ, which are hyperparameters of this algorithm.

The below code shows the implementation of the train function for REINFORCE + baseline.

Let’s test whether REINFORCE with baseline can learn a good policy and reduce the variance in learning.
As we did for REINFORCE, in the below code, we train the policy network for 3,500 episodes using 5 different random seeds. We freeze the weights of the network every 100 episodes and perform 20 episodes with the current policy to evaluate the test performance of the policy at each training phase.

REINFORCE with baseline converges to the best performance around 1000th episodes (although the result could slightly vary on each run) and achieves a much smaller variance compared to REINFORCE as we expected. The use of a baseline offered a significant performance improvement in this case.

Actor Critic

One disadvantage of REINFORCE and REINFORCE+baseline is that it has to wait until the end of the episode to perform the update because the update depends on the return G. Another disadvantage is while introducing baseline relieves the large variance, the variance often remains high because the algorithm relies on the noisy sampled return G.
These disadvantages are similar to Monte Carlo Methods for solving finite Markov Decision Process (MDP). To overcome this limitation, in the MDP setup, we use TD-learning, which uses value estimate to perform the online update of the estimate (i.e. bootstrapping). This helps reduce the variance too by removing the dependency on the sampled return G.

Can we use a similar strategy in the case of policy gradient? Yes, we can. Let’s think about how we can replace return term G in the REINFORCE with baseline update equation with a value estimate. The simplest thing we can do is to estimate the state value as the sum of the immediate reward and the value of the next state, i.e.

as we do with TD(0) method (You can learn more about TD(0) in this article). Then, we can write down the policy parameter update equation with advantage as follows:

This algorithm using the sum of immediate reward and the value estimate of the next state is called a one-step acrot-critic method.

In general. methods that learn approximations to both policy and value functions are called actor-critic methods, where the actor refers to the learned policy, and the critic refers to the learned value function.

Below is the pseudo-code of the N-step Actor-Critic algorithm, which is a generalization of the one-step actor critic method and uses the sum of the N-step of immediate return and state value estimate of the next state.

We can see that the return term (Gₜ) uses the value estimate while we did not have this term in REINFORCE+baseline.

Let’s see the implementation of this N-step Actor-Critic method.

In the CartPole-v0 environment with the specified hyperparameters, the actor-critic method does not outperform REINFORCE with a baseline. It takes a longer time to converge to the best score, and the variance around the performance is greater compared to REINFORCE with a baseline. This could be attributed to suboptimal hyperparameters (such as learning rate, the number of immediate reward steps, and discount rate) used in this code. However, this outcome also highlights the inherent challenges of actor-critic methods, which involve training two networks, as opposed to the simplicity of tuning a single value network in REINFORCE with a baseline.

Another possible explanation is that the value function in this environment is relatively straightforward, and the use of bootstrapping in actor-critic does not have a significant advantage over REINFORCE with a baseline. In environments with relatively simple state and action spaces, where complexity is not a significant factor, REINFORCE with a baseline might be more effective. Actor-critic methods, with their additional complexity of maintaining a value function, might introduce unnecessary complications in such scenarios.

Summary

In this article, we learned three fundamental policy gradient methods: REINFORCE, REINFORCE with baseline, and Actor Critic. The Policy Gradient Theorem provides the foundation of the policy gradient methods by offering a way to estimate the gradient of the objective function. REINFORCE uses the sampled return to learn the optimal policy. REINFORCE with baseline introduces a baseline estimate of the value at each state to stabilize and accelerate the learning. Actor Critic uses bootstrapping to further improve the learning performance (although this was not very effective for this article's case).
Depending on the problem setting, it would be important to carefully consider which approach would be more effective, whether using methods based on value function estimates or using policy gradient methods.

Codes

Reference

- Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

--

--

Yuki Minai

Ph.D. student in Neural Computation and Machine Learning at Carnegie Mellon University, Personal webpage: http://yukiminai.com