Coding PPO from Scratch with PyTorch (Part 1/4)

Published in

Analytics Vidhya

6 min readSep 17, 2020

Introduction

This is part 1 of an anticipated 4-part series where the reader shall learn to implement a bare-bones Proximal Policy Optimization (PPO) from scratch using PyTorch. Refer to the diagram above to see a rough roadmap of the series. I wrote this guide because when learning about policy gradient algorithms and PPO, and eventually attempting to implement it from scratch myself, I found it ridiculous how few PPO implementations exist online that are simple, well documented, well styled, and actually correct. Even the many self-acclaimed “simple” implementations of PPO that you can find in the first few google searches of “ppo github”, are oftentimes confusing, poorly commented, unreadable, and/or plain incorrect.

My goal in this series is to address this issue by providing and running through a reliable, well documented, well styled, and most importantly, bare-bones implementation of PPO. In this series, I shall take you through the steps in which I coded PPO from scratch, and give my thought process on my decisions as I go along. The ideal reader is someone who has experience in Python and PyTorch, and knows basic theory of Reinforcement Learning (RL), policy gradient (pg) algorithms, and PPO (I include PPO because this is a guide to teach you how to write PPO, not learn the theory behind PPO). If you are uncomfortable with any of the above, here are a few great links I found that you should get familiar with before continuing:

Learning Python

Learning PyTorch

Intro to Reinforcement Learning

Intro to Policy Gradient

Intro to PPO

If you are unfamiliar with all of the above, I recommend you go through each link in order from top to bottom before continuing.

Before we dive into the nitty-gritty of the code (that will be starting in Part 2), I’d like to first give an overview of the code I wrote and some statistics on its performance.

Code Overview

Code: PPO for Beginners

In my PPO implementation, I split all my training code into 4 separate files: main.py, ppo.py, network.py, and arguments.py.

main.py: Our executable. It will parse command line arguments using arguments.py, then initialize our environment and PPO model. Here is where we can train or test our PPO model.

ppo.py: Our PPO model. All the learning magic happens in this file.

network.py: A neural network module to use to define our Actor/Critic networks in the PPO model. It contains a sample Feed-Forward Neural Network.

arguments.py: Parses command line arguments. Provides a function that is called by main.

Actor/Critic models are periodically saved into binary files, ppo_actor.pth and ppo_critic.pth, which can be loaded up when testing or continuing training.

I also have my testing code primarily in eval_policy.py, which is called by main.py.

eval_policy.py: Tests the trained policy on a specified environment. This module is completely independent from all the other files.

Here are two diagrams roughly illustrating the training and testing pipelines:

Training Pipeline

Testing Pipeline

Results

I benchmarked my PPO implementation, PPO for Beginners, with Stable Baselines PPO2 on various environments, as can be seen below. In the graphs, solid lines represent means over all trials and highlighted regions represent variances over all trials. If you don’t know, Stable Baselines is a repository of optimized classical RL algorithms that we can use as benchmarks and reliable RL algorithms implementations in research. All hyperparameters used can be found here.

Note that PPO for Beginners sticks mainly with the vanilla PPO pseudocode, whereas PPO2 does a bunch of optimizations and tricks that we can explore in Part 4 of this series.

Pendulum-v0

Link to Pendulum-v0

GIF of Pendulum-v0 after solving with PPO for Beginners (Left) | (Right) Graph showing performance of PPO2 vs PPO for Beginners on Pendulum-v0.

BipedalWalker-v3

Link to BipedalWalker-v3

GIF of BipedalWalker-v3 after solving with PPO for Beginners (Left) | (Right) Graph showing performance of PPO2 vs PPO for Beginners on BipedalWalker-v3.

LunarLanderContinuous-v2

Link to LunarLanderContinuous-v2

GIF of LunarLanderContinuous-v2 after solving with PPO for Beginners (Left) | (Right) Graph showing performance of PPO2 vs PPO for Beginners on LunarLanderContinuous-v2

MountainCarContinuous-v0

Link to MountainCarContinuous-v0

GIF of MountainCarContinuous-v0 after failing to solve with PPO for Beginners (Left) | (Right) Graph showing performance of PPO2 vs PPO for Beginners on MountainCarContinuous-v0

MountainCarContinuous-v0 failed to solve in both PPO implementations (a good score is near 100); any guesses why that might be?

Answer: PPO is an on-policy algorithm that, like most classical RL algorithms, learns best through a dense reward system; in other words, it needs consistent signals that scale well with improved performance in order to consistently converge towards a desired behavior. This particular environment has a sparse reward system: reward = 100 - 0.1(action²) if the flag is hit else -0.1(action²). Thus, PPO doesn’t get many good indicators for learning unless it hits the flag through random exploration, which is rare. Even worse, if you look closely at the reward function, it actually penalizes moving over time; thus, unless you get lucky and hit the flag a few times in a row, PPO will tend to optimize toward a local maxima (i.e. maximizing the negative reward) by moving minimally, thus getting stuck at the bottom of the valley and never hitting the flag.

You might notice that in the PPO for Beginners graph, it has unusual spikes in increased performance up to an average episodic return of 70, yet it still fails to learn on average; this is because though PPO may get lucky in a few runs and hit the flag through exploration, a single round of policy gradient updates with a low learning rate is not enough for PPO to hit the flag consistently. As a result, in the next iteration, PPO collapses in performance until it can get lucky and hit the flag again. Off-policy algorithms tend to fare slightly better in these scenarios since they use replay buffers to train on past experiences (including successes), but as PPO is an on-policy algorithm, it needs a fresh batch of data after each iteration (so needs to regenerate successful rounds of data through exploration in order to learn).

Recap

As you can see, the vanilla PPO for Beginners results aren’t very impressive in comparison to stable baselines; however, the point of this repository isn’t to beat PPO2 (which uses a lot of tricks and optimizations not found in the original paper), but again to walk the reader through implementing a simple PPO from scratch. We can explore some optimizations and tricks to incorporate into our vanilla PPO in Part 4.

Again, if you just want the PPO for Beginners code, here it is.

Special thanks to Zhizhen Qin and Professor Sicun Gao for helping me along the way in writing this series.

Feel free to contact me about anything at eyyu@ucsd.edu.

I’ll see you in Part 2 to start writing PPO from scratch in-depth!