Advantage Actor Critic (A2C) implementation

Published in

Deep Learning made easy

3 min readDec 30, 2019

Internet is full of very good resources to learn about reinforcement learning algorithms, and of course advantage actor critic is not an exception. Here, here and here you have some good articles.

On the other hand I have implemented this algorithm recently and couldn’t find any good resources explaining how to implement it, yes lot of theory and some codes, but not a good explanation in my opinion.

In this article I want to cover the following topics:

What is the advantage and how to calculate it in A2C
Critic loss
Actor loss

Brief summary of A2C

A2C is a policy gradient algorithm and it is part of the on-policy family. That means that we are learning the value function for one policy while following it, or in other words, we can’t learn the value function by following another policy. We will be using another policy if were using experience replay for example, because by learning from too old data, we use information generated by a policy (ie. the network) slightly different to the current state.

Why does it matter to know that it’s on-policy?

Because that tells us how we can update the network. We can collect experiences still, but only process them immediately, the process looks like:

Interact with the environment and collect state transitions.
After n-steps, or end of the episode, calculate updates and apply them.
Throw away the data.

This works because the data we are going to use was generated following the same policy we are updating.

What is the advantage and how to calculate it for A2C

This is the main topic of this post. I have been struggling trying to understand this concept, but is actually damn simple!! Probably you already know what is the TD error, so here you have some tips to understand it:

The TD error is a value, and if we are speaking about TD error of a state value function (it could be action value function instead, like in Q learning, sarsa, etc), the TD error is defined as:

The advantage function is a function, not a value. We can approximate the advantage function by using the TD error but they aren’t the same thing, the advantage function is defined as:

The trick is we can change Q (the action value) allowing us to use only one network that predicts state values, not action values, otherwise we might need 2 networks to calculate the advantage (one for the action value and another for the state value):

Did you notice it? The advantage function and the TD error looks the same right? How does this looks like in code?

Critic loss

The critic basically tries to apply MSE between TD target and current state value, as we already have the difference on the advantage:

Actor loss

For the actor (in the case of categorical actions) we sample from a categorical distribution then calculate the loss like usually do for probabilities, that is using the negative log likelihood, scaled by the advantage:

Here you can find the full implementation for 1 step and n-step a2c: