Actor-Critic: Off-Policy Actor-Critic Algorithm

Cheng Xi Tsou
Geek Culture
Published in
10 min readAug 18, 2021

--

In this post, I will be exploring the ideas behind the paper Off-Policy Actor-Critic (Degris et al.) submitted to the ICML 2012. The paper presents an Actor-Critic method that is fully online and uses a behavior policy to learn the target policy. This off-policy learning method takes advantage of action selection in Actor-critic methods while having the benefits of being off-policy for better exploration and usage of experience replay. In the end, I will try to implement the algorithm described in the paper.

Critic: Policy Evaluation

Firstly, the paper describes the policy evaluation method for the Critic. Similar to a baseline Actor-Critic method, we can use a set of weights to learn a linear approximation of the value function. During implementation, we can just use a neural network to learn a non-linear approximation. The paper uses Gradient-TD methods to learn the weights for the value function in an off-policy setting and also guarantees convergence unlike methods with bootstrapping. These methods try to minimize the mean-squared error of a λ-weighted projected Bellman Equation. The final algorithm presented in the paper uses a variation called GTD(λ) which is introduced by Maei (2011). I won’t go into detail about the value approximation method, but it is essentially a variation of TD(λ) in an off-policy setting.

Off-policy Policy Gradient Theorem

Before we look at how the Actor learns the policy function, let’s see how the policy gradient theorem works in an off-policy setting. Here’s a quick review of the policy gradient theorem. First, we define the policy to be π(a|s), where the output is the probability of an action a given state s. The policy has a weight vector u, or parameters u if you’re using a neural network to learn. The goal is to maximize u for the performance measure function J:

Taken from Off-Policy Actor-Critic (Degris et al. 2013)

Here, d(s) is the distribution of states under b where P(sₜ = s | s₀, b) ∀t ∈ ℕ, and states are following a behavior distribution b. This is slightly different from the on-policy performance measure function as we are sampling from a separate behavior distribution. Recall how we updated the weights in REINFORCE: uₜ₊₁ = uₜ + α∇J(uₜ). We can get the gradient ∇J(u) by calculating an approximation of ∇J(u). For a complete derivation of this approximation, refer to my post on policy gradients.

Taken from Off-Policy Actor-Critic (Degris et al. 2013)

Here’s a quick rundown of the equation from right to left. Q(s, a) is a value function that evaluates an action a taken in state s. We multiply that by the gradient of the probability of action a given state s following policy π with respect to weights u. We then sum up the products for all action a in our action space, giving us an expectation of the gradient of the value for state s. Then, we multiply this by a weight given by the behavior distribution dfor state s and sum the weighted values for all states s in our state space. The final value would be the approximate gradient of the performance measure for our weights u.

The paper provides two theorems to justify this approximation:

Taken from Off-Policy Actor-Critic (Degris et al. 2013)
Taken from Off-Policy Actor-Critic (Degris et al. 2013)

Theorem 1 states that an update of any policy parameter in the direction of the approximation gives us a better policy. In an on-policy setting, the policy gradient theorem states that our approximation g(u) is equal to the gradient of the performance measure function J. While we can’t do this in an off-policy setting, Theorem 2 establishes a relationship between the solutions found using the approximation and the performance measure function. For proof of these two theorems, check the Appendix of the paper.

Actor: Online Updates with Eligibility Traces

Now that we have established the approximation for ∇J, we can come up with the update equation for the Actor. Just like with REINFORCE’s update equation, we want to find the expectation of the approximation instead of actually summing up all states and actions.

We start off with the off-policy policy gradient approximation in (1). Then in (2), we can get rid of the summation over all states by writing this as an expectation by sampling a state from distribution d. In (3), we multiply the expectation by two constants that equals 1, then rearrange some terms. The constant ρ is the importance weight. Here, b(a|s) is the probability of action a given state s.

By introducing b(a|s), we are able to get rid of the summation of all actions in a state s by sampling an action from the behavior distribution b. The paper simplifies the expectation with a new notation:

we introduce the new notation Eb [·] to denote the expectation implicitly conditional on all the random variables (indexed by time step) being drawn from their limiting stationary distribution under the behavior policy.

Now we have converted our approximation into an expectation that can be calculated for the update equation. The paper then subtracts a baseline from it then approximates the Q function by using off-policy λ-returns defined as:

Taken from Off-Policy Actor-Critic (Degris et al. 2013)

We can subtract a baseline from the Q function since the gradient of the baseline is 0. For a more detailed explanation, check my post REINFORCE with baseline. Here we have the final expectation that approximates g(u) as the λ-returns are slightly different from the Q function.

Taken from Off-Policy Actor-Critic (Degris et al. 2013)

The expectation is then used in the forward-view update equation:

Taken from Off-Policy Actor-Critic (Degris et al. 2013)

With this equation, we would have to sample a whole trajectory to calculate the λ-return. Fortunately, we can convert this forward-view update into a backward-view update for the algorithm to work fully online. Similar to the backward-view TD(λ) algorithm, we can use eligibility traces to update the parameters u instead. For an explanation of these traces, you can check out my post on value function approximations. Here’s the expectation written with eligibility traces,

Taken from Off-Policy Actor-Critic (Degris et al. 2013)

where δ is the TD error: δₜ = Rₜ₊₁ + γV(sₜ₊₁) - V(sₜ) and the eligibility trace is updated with the following equation:

Taken from Off-Policy Actor-Critic (Degris et al. 2013)

Then, we can use this expectation for a backward-view update equation:

Taken from Off-Policy Actor-Critic (Degris et al. 2013)

Here’s the pseudocode of the Off-PAC algorithm from the paper:

Taken from Off-Policy Actor-Critic (Degris et al. 2013)

The vectors eᵥ, eᵤ are traces used for the Critic and Actor respectively, and w is a trace used for the GTD(λ) algorithm to update the Critic. The vectors v and w are weight vectors for Critic and Actor. Note that we can simplify ψ(s, a) = ∇ ln π(a | s) in the implementation for the Actor update. The hyperparameters used in this algorithm are:

  • αᵛ, αʷ : step sizes for Critic
  • αᵘ : step size for Actor
  • λ : weight decay for traces
  • b(⋅|s) : stationary behavior policy distribution

Implementation

For the implementation of this algorithm, I decided to attempt to implement it exactly how it is in the paper. But first, I wanted to implement a vanilla Off-Policy policy gradient method without all the modifications with λ-returns and off-policy learning for the value function.

Off-Policy policy gradient implementation

Recall the expectation of the off-policy policy gradient theorem we derived:

For a vanilla Off-Policy policy gradient, we are just going to use the discounted cumulative reward Gₜ as the Q value. Here’s the pseudocode I typed up:

As you can see, it is basically the same as REINFORCE, except we are sampling the trajectory from a stationary behavior policy, and we multiply an importance weight ρ in the update step.

Here are the hyperparameters I used for this implementation:

  • α (learning rate): 0.01
  • γ (discount factor): 0.99
  • MAX_STEPS: 5000
  • NUM_EPISODES: 1000
  • behavior policy: uniform distribution over action space

For the experiments, I generated two trajectories in each episode: one from the behavior policy to train our target policy, and one from the target policy so we can evaluate our target policy at that episode. Although this may potentially double the training time, the runtime of each episode is still 2n = O(n).

Here’s the training history of our agent over 1000 episodes with the CartPole environment:

The grey line is the reward history for our stationary behavior policy and the blue line is the reward history for our target policy. As expected, the grey line does not improve since we are sampling from a stationary behavior policy. The blue line shows the improvement of our agent over the 1000 episodes.

Our agent was able to achieve an average score of 142.37 in 50 playthroughs over 3 trials. This is much better than our vanilla On-Policy policy gradient where we achieved a score of 79.6. The advantages of being off-policy are seen here. Since we sample our trajectory from a stationary behavior policy, our agent maintains the same level of exploration for the whole trial and can prevent itself from being stuck in suboptimal local maxima.

Off-Policy Actor-Critic implementation

Now, let’s try to implement the Off-PAC algorithm described in the paper. For my implementation, I decided to follow the paper’s implementation exactly and used learned linear approximators for the Actor and the Critic. Here’s a quick walkthrough through snippets of my code:

Here, we initialize the Actor and Critic weight vectors u and v, and trace vectors ev, eu, and w. The shape of these vectors is (input x output), where the input is the state space and the output is the action space for the Actor weight vector, and 1 for the Critic weight vector. Then we have some helper functions:

Since the paper uses linear approximators for the Actor and Critic, instead of a forward pass through a neural network we simply multiple the weight vector by the feature vector to get an output. The rest is pretty straightforward. We calculate the constants we need and update all the vectors:

After trying to run the agent with the MountainCar environment, I did not get the results I was looking for. Here’s the reward history following the behavior policy:

As expected, the behavior policy is basically the same as a random policy, with some lucky runs above a score of -5000. However, the average score of the agent with 50 playthroughs is still -5000. Even with the same hyperparameters as described in the paper, the policy did not improve. Even with the CartPole environment, the results were around the same as a random policy. Upon investigating the learned weights, the problem seems to come from gradient explosion. However, after reducing the learning rates, the agent gets stuck in a suboptimal policy.

Final Thoughts

Although I’m a bit disappointed that I couldn’t replicate the results in the paper, it is still interesting to learn about how policy gradients and value function approximation works in an off-policy setting. For the implementation of the Off-PAC algorithm described in the paper, I decided to try and copy the algorithm exactly, and used matrices for my weights instead of using a neural network to parameterize the learned functions. I definitely got a bit better at using Pytorch tensors since I had to do all the calculations manually.

A problem I was stuck on for a while was needing to do in-place operations on the weight vector for the target policy. Firstly, I needed to enable the requires_grad flag for the weight vector as I needed to calculate the gradient of the target policy probability with respect to the weight vector. To be able to get the gradient of the weight vector, it needed to be a leaf tensor since Pytorch autograd traverses through the DAG from non-leaf tensors to leaf tensors to perform the chain rule. Note that in Pytorch, Autograd keeps track of a DAG where nodes are tensors and edges are operations, where the arrow points to the resulting tensor from the operation. A leaf tensor is a tensor that has no operations performed on it, so in-place operations counts too.

The solution was to use detach() on the weight vector so that the resulting tensor is detached from the DAG. Then, I would have to enable the requires_grad flag again as I’m technically not doing an in-place operation but creating a new leaf tensor.

Code:

References:

--

--