Vanilla Policy Gradient from Scratch

3 min readAug 10, 2022

Build one of the simplest reinforcement learning algorithms, with PyTorch

Ever wondered how reinforcement learning (RL) works?

In this article we’ll build one of the simplest forms of RL from scratch — a vanilla policy gradient (VPG) algorithm. We’ll then train it to complete the famous CartPole challenge — learning to move a cart left to right to balance a pole. In doing this, we’ll also be completing the first challenge to OpenAI’s Spinning Up learning resource.

The code for this article can be found at https://github.com/alan-cooney/cartpole-algorithms/blob/main/src/vanilla_policy_gradient.py

Our approach

We’ll tackle this problem by creating a simple deep learning model that takes in observations, and outputs stochastic policies (i.e. probabilities of taking each possible action).

Then, all we need to do is collect experience by acting in the environment, using this policy.

After we have enough experience for a batch (a collection of a few episodes of experience), we’ll need to turn to gradient descent to improve the model. At a high level — we want to increase the expected return of the policy, which means adjusting the weights and biases to increase the probability of high expected-return actions. In the case of VPG, this means using the policy gradient theorem, which gives an equation for the gradient of this expected return (shown below).

And that’s really all there is too it — so let’s start coding!

Creating the model

We’ll start by creating a pretty simple model with one hidden layer. The first linear layer takes the input features from CartPole’s observation space, and the last layer returns values for the possible outcomes.

Getting a policy

We’ll also need to get a policy for the model, once per timestep (so that we know how to act). To do this we’ll create a get_policy function, which uses the model to output probabilities of each action under the policy. Then from this, we can return a categorical (multinomial) distribution that can be used to pick specific actions, randomly distributed according to these probabilities.

Sampling actions from the policy

From this categorical distribution, for each timestep we can sample it to return an action. We’ll also get the log probability of that action, which will be useful later when we calculate the gradients.

Calculating the loss

The gradient, which is derived in full here, is given as follows. Loosely speaking, it’s the gradient of the sum of the log probability of each state-action pair times the return for the whole trajectory that pair was a part of. The extra outer sum is just over several episodes (i.e. a batch), so we have significant data.

To calculate this with PyTorch, what we can do is calculate a pseudo-loss below and then use .backward() to get the gradient above (note we’ve just removed the gradient term):

This is commonly called a loss, but it isn’t really a loss as it doesn’t depend on performance. It’s just useful for getting the policy gradient.

Training an epoch

Putting all the above together, we’re now ready to train an epoch. To do this, we simply loop through episodes to create a batch. Within each episode, create a series of actions and rewards (i.e. experience) that can be used to train the model.

Running the algorithm

And with that, you’re ready to run the algorithm. You can run the full code from https://github.com/alan-cooney/cartpole-algorithms/blob/main/src/vanilla_policy_gradient.py , and you should see that the model has learnt the environment well (scoring 180+/200) after about 40 epochs.

I hope you enjoyed reading this, and if you have any questions just let me know in the comments!

Alan