Actor-Critic: Value Function Approximations

Cheng Xi Tsou
Geek Culture
Published in
11 min readJul 23, 2021

In my previous post, I discussed a way to reduce variance by using the generalized policy update equation, which is derived from the policy gradient theorem, and added a learned value function as a baseline. In this post, I want to elaborate further on the value function, the Critic, and how we can use different kinds of value function approximations to update the value function’s parameters.

Introduction

Taken from FreeCodeCamp

First of all, what is an Actor-Critic algorithm? According to the NIPS paper Actor-Critic Algorithms (Konda et. al 1999), the vast majority of reinforcement learning algorithms at the time fell under two categories: Actor-only methods or Critic-only methods.

An Actor-only method is one that works with a parameterized policy. The gradient of the performance is directly estimated by simulating the policy and updating the parameters in the direction of improvement. A downside to this is that there can be a large variance.

A Critic-only method is one that relies on approximating a value function and aims to approximate a solution to the Bellman Equation (see below), which then we can use the learned value function to build an optimal policy. We can think of these as off-policy methods as they indirectly learn an optimal policy by learning the optimal value function. A downside to this is that while we get a good approximation of the value function, this does not mean the policy will be optimal as well.

Bellman equation for estimating the value function in a non-deterministic environment

The paper then presents a new kind of method, Actor-Critic algorithms, which combines the strong points from both actor-only and critic-only methods. The critic will use the same approximation structure to learn a value function which is used to update the actor’s policy parameters.

Now, let’s first take a look at the REINFORCE with baseline algorithm as an example. Although Sutton&Barto do not consider this Actor-Critic as the value function is used as a baseline, not a critic, other sources such as the lectures by David Silver and the paper on A3C do consider it a kind of Actor-Critic. Nonetheless, we’ll still use this as an example as we’ve already seen this algorithm before in my previous post.

Taken from Sutton&Barto 2017

In this pseudocode, we can see the update equation for learning the Actor parameters θ (Policy) and Critic parameters w (state value function). I’ve already derived the Actor update equation from the policy gradient theorem before, so I will be focusing on the Critic in this post.

The Critic

The Critic, which is a value function serves as a baseline to update the Actor’s parameters. The way the updating works is by using a loss function to calculate the difference between a target value and an approximate value. The goal is to minimize that loss so that the approximate value will approximate the target, which is the optimal value.

In the example, the value function here is a state value function that calculates loss by using the difference between Gₜ and v(Sₜ, w) to update its parameters. We can generalize the loss equation δ to J(w) shown below:

Taken from David Silver’s UCL lecture 6 (COMPM050)

Then, we can get rid of the expectation as we are sampling the gradient. Here, we apply the chain rule one time to get rid of the exponent. The multiplier is absorbed by the learning rate.

Taken from David Silver’s UCL lecture 6 (COMPM050)

Now we have an update equation that is similar to the one used by Sutton&Barto’s REINFORCE with baseline! Note that when implementing the update for the Critic, we can use a network to backpropagate the MSE loss function directly so no need to manually perform the chain rule.

Approximating the value function

So how do we actually implement this? For the approximate value, it would just be the output of our value function using the current parameters. For the target value, it would be what our value function should output. In reinforcement learning, these would be the rewards. There are several ways to calculate the target value with rewards. Let’s look at the different algorithms we can use to update our value function.

A way we can do this is by a Monte Carlo method, which is what REINFORCE with baseline uses. We sample the whole trajectory, accumulate the losses at each step, then update the parameters. The target value would be Gₜ, which is the cumulative discounted rewards at timestep t. We can use the exact cumulative rewards which are all rewards from timestep t to the end since we sampled the whole trajectory beforehand.

Taken from David Silver’s UCL lecture 6 (COMPM050)

Another way is by TD(0), a form of temporal difference, which is what DQN uses. You can see me implementing DQN on Atari games in another post. When sampling the trajectory, we apply updates at each time step, also called online learning, where the parameters are updated while playing out the episode. Here, the target value would be Rₜ₊₁ + γ v(Sₜ₊₁, w), where we estimate the target value by using another estimate, v(Sₜ₊₁, w), a practice called bootstrapping. Unlike MC, we use an estimated target since we are only looking ahead one step so we don’t know the exact future rewards.

Taken from David Silver’s UCL lecture 6 (COMPM050)

Forward-view TD(λ)

The last method we can use is TD(λ), which is a medium between MC and TD(0). In MC, we can get the exact target value at timestep t since we can “look” from time step t to the end of the episode. In TD(0), we estimate the target value since we can only “look” from time step t to timestep t+1. In TD(λ), we can generalize this by looking N-steps ahead. The target value would be:

Taken from David Silver’s UCL lecture 4 (COMPM050)

and the update equation would look like this:

Taken from David Silver’s UCL lecture 4 (COMPM050)

There are several ways we can utilize looking N-steps ahead. We can combine different Ns, such as taking the average of 1-step ahead and 2-step ahead. So the question is, how many steps ahead should we look? The answer is not limited to just one N, we can average N-step returns across all Ns to come up with a more robust target value. Instead of averaging the N-step returns normally, we assign a weight to each N-step and sum up all the weighted N-step returns. Now how do we assign the weights?

One way we can assign the weights is by assigning larger weights to smaller Ns where we look immediately ahead while lesser weights for bigger Ns that will look to the end of the episode. We choose the different weights by following a geometric decay for the weights instead of a linear decay as it is more efficient to compute a geometric sequence. The diagram below is a visual of what this looks like.

Taken from David Silver’s UCL lecture 4 (COMPM050)

The weight is calculated by (1 - λ)λ^(n-1) where n is the number of steps we look ahead and λ is a constant from [0, 1]. Each weight is multiplied by a λ^(n-1) so that the weights will sum to 1. We can observe that changing λ determines how fast the weights decay and how many different Ns we have. The target value would be:

Taken from David Silver’s UCL lecture 4 (COMPM050)

This way of looking forward N-steps and taking the average N-step returns is called Forward-view TD(λ). We can update the value function parameters by sampling the whole trajectory then applying offline updates episodically, similar to the Monte Carlo method.

You may notice that the summation goes to infinity. This is assuming that we have a continuing task that can go beyond the terminal state. However, most of the time this is not true. We can use this equation instead to get the target weights:

Taken from Sutton&Barto 2017

Here, we use the same decay from timestep t to timestep T - 1. Then, we use a different term for the terminal state which accounts for the weights from T-1 to infinity. This would ensure that our weights will sum to 1. Here’s a diagram that illustrates the two parts of this equation:

Taken from Sutton&Barto 2017

We can see that each weight corresponds to a section of the area under the curve and the final weight for the terminal state accounts for the rest of the curve. Here’s the update equation for a Forward-view TD(λ):

Taken from David Silver’s UCL lecture 4 (COMPM050)

We can also see how MC and TD(0) come from the Forward-view TD(λ). When λ = 1, every N-step weight becomes 0 except for the last one where N-step is the episode length. When λ = 0, the first weight is 1 and all the other weights become 0 so we only look ahead one step.

Backward-view TD(λ)

Another way to assign weights is by using eligibility traces, called Backward-view TD(λ). The concept of eligibility traces is simple. I’ll give a scenario described in David Silver’s UCL Lecture 4. Suppose you are a mouse. A bell rings three times, a light bulb lights up, then you get shocked.

Taken from David Silver’s UCL lecture 4 (COMPM050)

Now, is the reason that you got shocked because of the bell or the light? Most people would say it was because of the light because that was what happened right before you got shocked. Someone may think it was because of the bell since the bell rang three times compared to the light going off once. We can think of these two reasonings as the Frequency heuristic, where we assign the blame to the bell since it happened more often, and the Recency heuristic, where we assign the blame to the light since it happened more recently. Eligibility traces combines the two heuristics to assign a weight to each N-step.

Taken from Sutton&Barto 2017

We can define the eligibility trace as a vector z, and initialize the trace as 0. We update the eligibility trace by adding the most recent gradient and discount the current trace by the natural timestep discount factor γ and the constant λ. The λ ∈ [0, 1] constant here determines how much you value a frequency versus recency. A high λ would mean the eligibility trace doesn’t decay as fast, so we value frequency more since if the same state appears again, then the trace would be adjusted in the ∇v(Sₜ, w) direction again. A low λ would mean we value recency more as most of the eligibility trace would be determined by the more recent ∇(Sₜ, w). This eligibility trace z would serve as the weight for each N-step.

So how do we use this z to perform TD(λ)? Here is the pseudocode from Sutton&Barto on how we can use this eligibility trace to update our value function.

Taken from Sutton&Barto 2017

Unlike the Forward-view TD(λ), we can use eligibility traces to perform online updates, where we update the value function at each time step. First, we initialize an eligibility trace z for the current episode. Then, we update the eligibility trace using the aforementioned update equation. Similarly to TD(0), we calculate a loss by using the target value of looking one step ahead. We then update the value function by updating the parameters w in proportion to the eligibility trace.

We can observe some parallels with Forward-view TD(λ). In Forward-view TD(λ), we average all N-steps by assigning a decaying weight to each N-step. We can only update at the end of the episode as we need to “look forward” from each timestep t to calculate the N-step target value. In Backward-view TD(λ), it seems like we are only looking one step ahead to get the target value. However, this is not true. The eligibility trace is the cumulative weight that has been updated at each timestep and contains information from each state encountered. When we update the value function in proportion to the trace, we are essentially looking N-step backward from time step t.

Taken from Sutton&Barto 2017

We can also see how MC and TD(0) come from Backward-view TD(λ). When λ = 1, the eligibility trace z is only discounted by the time step discount factor γ, so the trace looks backward from time step t to the beginning of the episode, the same as Monte Carlo. When λ = 0, the eligibility trace z is equal to the most recent gradient of the value function, so the trace only looks one step backward. If we sum up all the updates in an episode with λ = 0, the result would be equivalent to all the updates of a TD(0) method.

Summary

In this post, we’ve covered in detail the different ways we can update the value function, the Critic of an Actor-Critic algorithm. The way to update the value function is by updating the parameters to minimize the loss between the true target value and an approximate value. We’ve gone over three methods, Monte Carlo, TD(0), and TD(λ), of updating the value function in practice and how they choose their target values. Monte Carlo uses Gₜ as an exact target value by looking ahead to the end of an episode, TD(0) uses Rₜ + V(Sₜ₊₁, w) as an estimated target value by looking ahead one step, and TD(λ) constructs an estimated target value by looking N-steps ahead. We’ve also established two kinds of updating mechanisms, offline and online updates. MC and Forward-view TD(λ) are updated at the end of an episode while TD(0) and Backward-view TD(λ) are updated at every step of an episode.

Overall, this is a very valuable experience for me going more in-depth about approximating value functions. While I was learning about Q learning and implementing DQN, I didn’t really understand the concepts behind approximating value functions and only understood the Bellman equation that is used to update the Q function’s parameters. I then delved into Policy Gradients and learned about how to update the policy parameters for the Actor. Now, with a deeper understanding of how to update the value function parameters for the Critic, I will be exploring and implementing Actor-Critic algorithms in the coming posts.

References:

--

--