Continuous control with A2C and Gaussian Policies —MuJoCo , PyTorch and C++

Vittorio la Barbera
6 min readApr 16, 2020



I’ll give for granted that the reader has some knowledge of deep learning and reinforcement learning so that I don’t have to write what a reward function is or what a policy is and so on. If some of the following material is confusing or not clear please let me know! What I’ll write here are some step by step considerations on a little project I’m doing. The project is all in C++ and you can find the code here:

I’m trying to keep the code as simple as possible so that anyone can take it and modify it as they please without going crazy navigating the code. Why C++ and not Python? well it’s faster and I think is a nice skill to polish :) plus I can access MuJoCo directly without any wrapping. NOTE: this is still a work in progress and any feedback on the code/algorithm would be very much appreciated!


The goal of this project is to have an underpowered pendulum and learning how to make it swing up and stay upright. I think this is an interesting problem because the setup is fairly easy and it forces us to deal with continuous state and action spaces. Now there are many options to deal with continuous action spaces. I’ve decided to tackle this problem with Policy Gradients algorithms because are better suited for this task and instead of using DDPG (Deep Deterministic Policy Gradients) I’m going to use A2C (Advantage Actor Critic) with Gaussian Policies.

Gaussian Policies

Gaussian policies are a specific case of a stochastic policies. In Policy Gradients we have a parametrised policy:

It can be any function approximator with parameters theta, in my project we’re going to use simple feed forward neural nets. Now in continuous state space and discrete action space the number of possible actions determines the number of neurons in the last layer. Then during learning what is learned is a distribution over actions using a softmax function in the last layer. Because in continuous action space we have infinite possible actions we need something a bit more convoluted. What we’re going to use is a Gaussian Policy.

The idea is to get a mean and standard deviation and then sample a Gaussian to get our action.

A gaussian looks like this :)

How do we implement this? well there are to my knowledge three ways of implementing it:

  1. Using two policies one that outputs a mean and the other that outputs a standard deviation. After obtaining those two, feeding a state in the policy we can sample the action from a Gaussian distribution with the mean and standard deviation obtained. One important note is that the standard deviation has to be positive so we need to use a soft plus function (many neural net frameworks has it already implemented for us).

2. The other way is to have a single policy with two neurons at the end one for the mean and the other for the standard deviation. Same softplus function is required for the output of the second neuron.

3. The third way it’s something in between the first and second point and that I’ve seen implemented here: . Is to have a common a first layer and then split the network in two heads one outputting the mean and the other the standard deviation.

A simple diagram showing a single neural network getting both mean and standard deviation, you can imagine how the others can be represented.

A common problem with any of these 3 approaches is that in most cases the action that we want is usually constrained in a given interval (in our case [-2,2]) so we need to clamp what we sample from the distribution.

Now let’s talk about how do we train our network, as we all know to train a neural network we need a loss function that’s where the REINFORCE trick comes handy:

where J(theta) is our cost function with respect to the parameters theta, G_t is the total return hence the sum of all the rewards up to time t. The expectation can be approximated with Monte-Carlo by running several episodes.

Ok nice but what’s that weird logarithm? well since we’re using a Gaussian that is equal to:

We could compute the gradient for this logarithm by hand if we’re using a linear function approximator since:

I used different subscripts for the parameters theta in case we’re using two different networks to estimate mean and standard deviation.

I need one more thing to show my implementation and that is the Advantage Actor Critic method. This method works with value functions V(S_t) and state-action value functions Q(A_t,S_t) these functions are basically telling us how good is the state we’re currently in and how good is the action taken from the state we’re currently in respectively. Note that the value function is a neural network as well that we train using a TD error as you can see in the screenshot of the code below. In Policy gradients usually we subtract this value function when computing the loss. However it’s still not playing an active role when learning so:

or more compactly:

Now we have all the ingredients to cook our code:

This is how we sample an action from the gaussian distribution.
This is how we learn after selecting an action and stepping in the environment.

The Environment

Now since I’m coding everything in C++ I’ll need to create my own environment and to do so I’ll use the MuJoCo physics engine.

The only thing I’d like to discuss here is the reward function and some weird stuff I do in the step function when applying an action in the environment.

The reward function I chose is this one:

Where theta in this case is the angular position of the pendulum, theta dot is the angular velocity and A_t is the action selected. Theta in this case is bounded between [-pi, pi] so that lowest value can get is going to be roughly -9 and the highest value is 0 (because the desired angle is in the middle at zero), which is great because we want to penalise as much as possible the position of the pendulum if it’s not at the desired position. We also want to penalise the velocity otherwise the pendulum will just swing very fast indefinitely and collect high reward frequently when hitting angle 0. For the same reason we want to penalise the action.

Now if you look at the code I do weird stuff to the angle and that’s because of how MuJoCo handles angular positions, first of all the bound is unbounded so I need to take the module of it at 2PI so that the range is going to be [0, 2pi] or [0,-2pi] depending if it’s spinning clockwise or counter-clockwise, the rest of the code is just to remap the angle between [-pi, pi].

the code for the step function


What I’ve learned so far is that the RL framework is more delicate than what I thought. Maybe more advanced methods can produce better and more robust results.


I would like to thank Kushashwa Ravi Shrimali ( and Isaac Poulton ( for their opinions on the code/algorithm.