Actor-Critic: Implementing Actor-Critic Methods

Cheng Xi Tsou
Geek Culture
Published in
10 min readAug 3, 2021

In this post, I’ll be implementing some Actor-Critic methods using the policy gradients methods and value function approximations from my previous posts. I won’t focus too much on the theory and details but more so on combining previous ideas of Actor and Critic.

Actor-Critic Algorithms

First, let’s look at the One-step Actor-Critic algorithm given by Sutton&Barto:

Taken from Sutton&Barto 2017

We’ve learned how to update the Critic (Value function) which “critiques” the current state and the Actor (Policy function) which is updated in the direction that the Critic tells us to. The One-step Actor-Critic algorithm here is fully online and the Critic uses the TD(0) algorithm to update the value function’s parameters w. Recall the TD(0) update equation:

Taken from David Silver’s UCL Lecture 7

Here, we are using the temporal difference between a target value and approximated value to update the value function in the gradient direction. Since the TD(0) algorithm updates at each time step, the target value is estimated by bootstrapping.

Now let’s look at the Actor. The Actor updates the policy parameters θ using the evaluation from the Critic. Recall the policy update equation we derived from the policy gradient theorem:

Taken from David Silver’s UCL Lecture 7

This policy update equation is used in the REINFORCE algorithm, which updates after sampling the whole trajectory. Since Sutton&Barto’s One-step Actor-Critic uses online updates, we update the policy parameters in the direction of the state-action value approximated by the value function instead of using cumulative rewards Gₜ. Now if we only used the value function to determine the direction of the update, this results in a fairly biased result. We can use the concept of subtracting a baseline to make it unbiased. You can read more about my implementation of REINFORCE with baseline here.

In REINFORCE with baseline, we used the value function as the baseline, so naturally, we can here too. Instead of updating the policy in the direction of the approximated state-action value, we adjust it by the advantage of choosing a specific action in this state. We can calculate this advantage by looking at the difference between the state value and the state-action value.

Taken from David Silver’s UCL Lecture 7

A quick review: Q(s, a) is the value of a state s after taking action a while V(s) is just the value of a state s.

The naive way to do this is by learning a state-action value function Q and state value function V with different parameters. Doing this would make our algorithm way more complicated as we’ll have to keep track of the Q, V, and Policy parameters. Sutton&Barto use a simpler approach here. We can use (R + V(s’)) as the estimated state-action value to compute the advantage, which is the same as TD error!

If you look at different implementations of the Actor-Critic algorithm, you may notice some differences. For example, Keras and Pytorch use a Monte Carlo method to update the Actor and Critic. While Sutton&Barto do not consider the Monte Carlo approach a true Actor-Critic method as the Critic is only used as a baseline to reduce variance, those approaches are still effective.

A question you may have now is, can the Critic use different value approximation methods? The answer is yes. Recall the different approximation methods we have discussed:

Monte Carlo:

Taken from David Silver’s UCL Lecture 7

TD(0), the one used by Sutton&Barto:

Taken from David Silver’s UCL Lecture 7

Forward-view TD(λ):

Taken from David Silver’s UCL Lecture 7

Backward-view TD(λ):

Taken from David Silver’s UCL Lecture 7

We can even do the same for the advantage function for the Actor. Simply substitute ϕ(s) = ∇v(S, w) for the Critic with ϕ(s) = ∇ln π(A|S, θ) for the Actor.

Here’s the pseudocode for an online Actor-Critic using Backward-view TD(λ) for the Actor and the Critic by Sutton&Barto:

Taken from Sutton&Barto 2017

We can also implement a Forward-view TD(λ) for Actor and Critic, but similar to a Monte Carlo method, we would have to sample the full trajectory and perform offline updates episodically.

Implementations

For the rest of the post, I’ll be showing my Pytorch implementations of the Actor-Critic method and talk about some challenges and things I did. I’ve implemented Actor-Critic using TD(0), Backward-view TD(λ), and Forward-view TD(λ). For the Monte Carlo implementation, refer to my post REINFORCE with Baseline. Here are the neural networks I used to parameterize my Policy and Value function:

class PolicyNetwork(nn.Module):

#Takes in observations and outputs actions
def __init__(self, observation_space, action_space):
super(PolicyNetwork, self).__init__()
self.input_layer = nn.Linear(observation_space, 128)
self.output_layer = nn.Linear(128, action_space)

#forward pass
def forward(self, x):
#input states
x = self.input_layer(x)

#relu activation
x = F.relu(x)

#actions
actions = self.output_layer(x)

#get softmax for a probability distribution
action_probs = F.softmax(actions, dim=1)

return action_probs

Statevalue Network:

class StateValueNetwork(nn.Module):

#Takes in state
def __init__(self, observation_space):
super(StateValueNetwork, self).__init__()

self.input_layer = nn.Linear(observation_space, 128)
self.output_layer = nn.Linear(128, 1)

def forward(self, x):
#input layer
x = self.input_layer(x)

#activiation relu
x = F.relu(x)

#get state value
state_value = self.output_layer(x)

return state_value

Here are the hyperparameters I used for my implementations:

  • γ (discount factor): 0.99
  • NUM_EPISODES: 1000
  • MAX_STEPS: 10000
  • α (Actor LR): 0.01
  • β (Critic LR): 0.01
  • λ (Actor): 0.8
  • λ (Critic): 0.8

For each of the implementations, I’ll be using OpenAI’s solved condition of an average score of 195 over the most recent 100 episodes.

TD(0) implementation

Taken from Sutton&Barto 2017

For the TD(0) AC, I followed the pseudocode given by Sutton&Barto above. In my implementation, I used MSE as the loss function for the Critic, which is the same thing as the pseudocode above which applies the chain rule once. I’m also using Stochastic Gradient Descent (SGD) as my optimizer as described in the pseudocode and also to get a fair comparison between the different implementations.

Let’s look at the training history of my result:

Our agent was able to achieve an average score of 61.828 in 50 playthroughs over 5 trials. In the training history, we can see that there is a steady upward trend in score despite the high variance. Compared to the human benchmark (me) score of 30.8 and random policy score of 21.48, a score of 61.8 is definitely much better.

An explanation for the high variance could be because of the returns. For REINFORCE with baseline, I was able to use whitening rewards to normalize the rewards to reduce variance. That was because the algorithm samples the whole trajectory before applying updates, so I was able to apply normalization. Another reason for the variance is the sudden drops in the score. This is because of a phenomenon called catastrophic forgetting in which the agent “forgets” past states and needs to relearn patterns again.

Here’s the training result when using the Adam optimizer:

With the Adam optimizer, the agent was able to achieve an average score of 500 in 50 playthroughs. The agent also hit our solved condition in just 443 episodes of training.

Forward-view TD(λ) implementation

Taken from Sutton&Barto 2017

For the Forward-view TD(λ) AC, there wasn’t any pseudocode but the concept is similar to the REINFORCE with baseline described by Sutton&Barto above. Instead of calculating Gₜ return from step t, we calculate Gₜ lambda-returns from step t, which averages N-step returns for all Ns with decaying weights assigned to each N-step return. I used Stochastic Gradient Descent (SGD) as my optimizer and used MSE as my loss function to learn the value function. I also used whitening rewards to combat gradient explosion and reduce variance.

Lets’ look at the training history of my result:

Our agent was able to achieve an average score of 146.27 in 50 playthroughs over 3 trials. I had only run 3 trials since the algorithm had a longer run time, and it was enough to ensure that I didn’t have an outlier. Compared to TD(0), our agent was able to achieve a better score of 146.27 compared to 61.83. We can also see that there isn’t as much variance due to the more robust target values that combine different N-step target values from all Ns. There is also an instance of catastrophic forgetting that happens around episode ~160 after the agent achieves a high score of 500.

When I was calculating the N-step returns, my solution’s runtime was O(n³) which I probably could’ve reduced to O(n²), so it resulted in a much longer runtime of around 16 minutes compared to TD(0)’s 2 minutes. This is a downside to the Forward-view TD(λ) which needs a higher time complexity for calculating N-step returns for all Ns at each time step and also a higher space complexity for storing the trajectories. This is also a reason why most algorithms choose to use backward-view TD(λ) instead of forward-view as this is not an issue with eligibility traces.

Here’s the training result when using the Adam optimizer:

With the Adam optimizer, the agent was able to achieve an average score of 500 in 50 playthroughs. The agent also hit our solved condition in just 262 episodes of training.

Backward-view TD(λ) implementation

For the Backward-view TD(λ) AC, I followed the pseudocode given by Sutton&Barto above. In my implementation, I didn’t use an optimizer and performed manual updates to the network parameters so that I could apply the eligibility traces. This is also a reason why I used SGD for the other implementations to get a more fair comparison of results.

Let’s look at the training history of my results:

Our agent was able to achieve an average score of 334.116 in 50 playthroughs over 5 trials. The agent also hit the solved condition in an average of 266.2 episodes. Compared to TD(0), our agent was able to achieve a much better score. Theoretically, the Forward-view and Backward-view TD(λ) are equivalent but as always in reinforcement learning, there is the problem of replicating results due to high variance. The training time of Backward-view TD(λ) is around the same as TD(0), as both do online updating. Compared to Forward-view TD(λ) which has a high time and space complexity, Backward-view TD(λ) accumulates an eligibility trace at each step which allows each update to take account of all the past updates. The neat thing is the trace only has a space complexity of O(|θ| + |w|) = O(1), which is the Actor and Critic network parameter size. When calculating the target value returns, we only need to calculate the 1-step TD error, which is O(n) for one episode.

As we have to perform updates manually, the Adam optimizer would not work in this case.

Summary

Here’s a summary of results including Monte Carlo AC from REINFORCE with Baseline:

Note that as I only ran 1 trial for the Adam optimizers, the number of episodes it took to solve may vary. An interesting thing to note is that for the solved implementations, the ones using SGD optimizers achieve a score of ~300 while the ones using Adam optimizer achieves a score of ~500. This may be because the Adam optimizer introduces update momentum and acceleration which allows the agent to climb to a local maxima faster than when using SGD. Another thing to note is that the choice of the CartPole environment has a huge impact on the effectiveness of each algorithm. CartPole only gives a uniform reward of +1 for every timestep in a non-terminal state, so the TD(λ) algorithms aren’t particularly better.

Final Thoughts

The three implementations took much longer than expected, and it was largely due to me misunderstanding the pseudocode and the Pytorch library. While TD(0) and Forward-view TD(λ) were fairly straightforward for me, I was stuck on Backward-view TD(λ) for a while. I was using the prebuilt optimizers and was unsure where to apply my eligibility traces. After watching this video that gives a short overview of the different optimizing algorithms, I realized that I should just perform manual updates.

Another thing I learned to use was setting the required_grad attribute to True when forming a tensor that I needed to perform backpropagation on which would help track Tensor operations. When calling backward() multiple times on tensors that had overlapping leaf tensors, I would need to use backward(retain_graph=True) instead so that the computational graph isn’t freed from memory after calling backward().

Overall, this was a challenging task but rewarding one which I learned a lot and feel more confident in proceeding to harder algorithms. Since there weren’t a lot of resources on eligibility traces, I ended up being stuck for a while on them. In future posts, I want to address Actor-Critic algorithms for continuous action spaces, Natural Actor-Critic, Asynchronous Advantage Actor-Critic (A3C), Deterministic Policy Gradients, and Proximal Policy Optimizations.

Code:

References:

--

--