Simple PPO implementation

Published in

Deep Learning made easy

5 min readMay 20, 2020

Back with another simple implementation for reinforcement learning. I’m far from being an expert, but I hope you can learn something new. This time I wanted to keep it as simple as possible so I could concentrate on learning PPO basics. Here I’m solving CartPole-v1 environment with TD(0), or so called one-step TD.

You can read Reinforcement Learning: An Introduction for a better explanation on this topic, but basically: take one step on the environment, execute algorithm, take on step on the environment, execute algorithm… so no need to collect batches and then calculate the discounted reward. Just play with the most recent reward and update policy on every step. Very likely this doesn’t work on real world problems, but I found it’s good enough for playing with the simplest openai gym environments.

What a weird policy! It learned to vibrate rather than balancing the pole.

Useful resources:

OpenAI blog post presentation: https://openai.com/blog/openai-baselines-ppo/
OpenAI spinning up PPO (site with lot of information about RL): https://spinningup.openai.com/en/latest/algorithms/ppo.html
Paper: https://arxiv.org/abs/1707.06347
Interesting paper: Implementation Matters in Deep RL: A Case Study on PPO and TRPO, about how important are those little implementation details not mentioned on papers or blog posts, but might be as important as the algorithm it self like: reward normalization, gradient clipping…

Architecture

I’ve used the same architecture as in my previous post, copied from stable baselines repo. Nothing fancy here.

Actor

Critic

Policy loss

Implementation

And here is where the interesting part starts. I started by looking into Spinning Up ppo section as I knew they explain RL topics very well. They opt for implementing PPO clipping rather than PPO penalty:

PPO-Penalty approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient over the course of training so that it’s scaled appropriately.
PPO-Clip doesn’t have a KL-divergence term in the objective and doesn’t have a constraint at all. Instead relies on specialized clipping in the objective function to remove incentives for the new policy to get far from the old policy.
Here, we’ll focus only on PPO-Clip (the primary variant used at OpenAI).

Here is the formula:

Important note: that L stands for likelihood I believe, not for loss, because in the paper although mentioned as loss it says to maximize it, rather than minimize. As with the negative log likelihood, that means minimize -L (because solvers usually only minimize, thus minimize -L is the same as maximizing L).

Something that confused me was the ratio between those two policies. Which one is the old policy and which is the new policy? Based on that notation isn’t clear to me. Among other things we have this on the paper:

Cool tip 1:

If you look on the spinning up repo you will see they use the difference between the log of the probabilities, not the division of the probabilities. Both things are equivalent, but logs tend to be used more often because it’s numerically more stable (and also the frameworks returns the log_prob instead of the probability, if we want to get rid of the logarithm that’s an unnecessary waste of cpu time).

https://github.com/openai/spinningup/blob/20921137141b154454c0a2698709d9f9a0302101/spinup/algos/pytorch/ppo/ppo.py#L232

Cool tip 2:

Something I don’t understand yet is why they say they use the following formula (equivalent) instead of the first one:

This is a pretty complex expression, and it’s hard to tell at first glance what it’s doing, or how it helps keep the new policy close to the old policy. As it turns out, there’s a considerably simplified version of this objective which is a bit easier to grapple with (and is also the version we implement in our code):

And the simplified formula becomes:

Unless I’m missing something that won’t work (and it didn’t work for me), at least if the value network and the policy network don’t share weights. If you only use the advantage (as might happen if “g” is the smallest value) to calculate the loss of the policy, the gradients will be 0. If you calculate the advantage the way I do, the advantage don’t depend on the policy at all thus gradients will be 0:

So do what they do instead on the code, just implement the first formula that is already pretty simple:

https://github.com/openai/spinningup/blob/20921137141b154454c0a2698709d9f9a0302101/spinup/algos/pytorch/ppo/ppo.py#L232-L234

Cool tip 3:

tanh activations didn’t work well for me as the gradients looks like vanished pretty fast. Using mish activation (as in my previous posts) helped with it like a charm. Still don’t know about it? What are you doing with your life!! Go and learn about it!! I didn’t try relus because I wanted to avoid dying relus.

Cool tip 4:

Use tensorboard. The more I use it, the more I learn how to debug NN with it. Your network doesn’t learn anything? Take a look on the gradients! It does learn and then forget everything? Take a look on the gradients! And well, most of the time take a look on the gradients. Of course other things can go wrong like implementing incorrectly the value function, the advantage function, the policy update…

The final setup for this mini project:

Probability of selecting each action. I wanted to be sure it didn’t collapse into a suboptimal policy like always selecting right or left.

We can see actor loss and critic loss approaching 0 pretty fast. That means the environment is easy to learn and indeed the agent soon stopped being surprised by new states.

Actor’s gradients seems to be vanishing.

Maximum reward per episode is 500. We can see the agent solved it after 180 episodes, not bad! But for some reason it forgot everything a bit after. Probably learning rate annealing would have been a good idea. Or using KL divergence for early stopping as spinning up implementation does.

Final words

And that’s about it. I was very reluctant to start looking into PPO as I thought it would be much more difficult. I know I’m not implementing most of the important stuff, but I just wanted to have a taste of it. It’s always super rewarding seeing an agent solving an environment!!

Notebook: https://github.com/hermesdt/reinforcement-learning/blob/master/ppo/cartpole_ppo_online.ipynb