Coding PPO From Scratch With PyTorch (Part 2/4)

10 min readSep 17, 2020

Welcome to Part 2 of our series, where we shall start coding Proximal Policy Optimization (PPO) from scratch with PyTorch. If you haven’t read Part 1, please do so first.

Note that going forward, I will be posting code screenshots rather than GitHub gists because I don’t want you to just copy-paste code (you can just go to the main repository for that). Instead, you are encouraged to follow along this tutorial while coding manually in another window.

We will be following the PPO-clip variant with pseudocode found in OpenAI’s Spinning Up docs and an Actor-Critic Framework. Here’s a picture of the pseudocode:

Pseudocode of PPO on OpenAI’s Spinning Up doc.

Initial Thoughts: Only 8 steps? Nice. Since this is a pseudocode for a learning algorithm, might be wise to first design the way our code will flow. This pseudocode looks like it can fit all in one function; we’ll call it learn. It appears that we will need to write subroutines for many steps (i.e. Step 3 wants us to basically roll out a bunch of simulations. In that case, we can define something like rollout later), so it’s best to encapsulate everything into a class PPO. This way, to train on an environment, we can first create a PPO object, then simply call learn.

First, let’s set up our PPO class in a file called ppo.py:

Cool, pat on the back. Let’s look at Step 1:

Step 1

Here’s where we’ll initialize our actor and critic networks. This means we’ll either need to import a neural network module or write our own. Let’s do the latter; we’ll do something similar to PyTorch’s tutorial on creating a neural network with torch.nn. We’ll create a very basic Feed Forward Neural Network. If you’re not comfortable with neural networks, watch this series.

Let’s set up our neural network module in a new file network.py:

import torch
from torch import nn
import torch.nn.functional as F
import numpy as npclass FeedForwardNN(nn.Module):
  def __init__(self):
    super(FeedForwardNN, self).__init__()

We’ll need to define our neural network layers now. We can use a few basic nn.Linear layers, nothing too fancy. We need to define the input and output dimensions, so let’s add some parameters to __init__ to capture that.

def __init__(self, in_dim, out_dim):
    super(FeedForwardNN, self).__init__()    self.layer1 = nn.Linear(in_dim, 64)
    self.layer2 = nn.Linear(64, 64)
    self.layer3 = nn.Linear(64, out_dim)

Note that I chose 64 arbitrarily, it doesn’t matter too much. Our __init__ is done; now we can define a forward function to do a forward pass on our network. We can use ReLU for activation (again picked arbitrarily). Since we’re planning on using this network module to define our actor and critic, and they both will take in an observation and return either an action or a value, we’ll set observation as a parameter. One thing to note is that the input to our network must be a tensor, so we should convert our observation to a tensor first in case it’s passed in as a numpy array.

def forward(self, obs):
  # Convert observation to tensor if it's a numpy array
  if isinstance(obs, np.ndarray):
    obs = torch.tensor(obs, dtype=torch.float)
  
  activation1 = F.relu(self.layer1(obs))
  activation2 = F.relu(self.layer2(activation1))
  output = self.layer3(activation2)  return output

We are now done with defining our network module; we are ready to define our actor and critic networks. Here’s how network.py should look like:

Back to ppo.py; we should be ready to do Step 1 really easily now and define our initial policy, or actor, parameters and value function, or critic, parameters.

from network import FeedForwardNNself.actor = FeedForwardNN(

Uh oh, road block. We don’t have any information on input or output size, which depends on the environment. Since we’ll need access to that environment in many subroutines as well, let’s just add it as an instance variable in our PPO __init__.

def __init__(self, env):
  # Extract environment information
  self.env = env
  self.obs_dim = env.observation_space.shape[0]
  self.act_dim = env.action_space.shape[0]

Eh, we’ll need our actor and critic networks later too, so let’s define them as instance variables in __init__ too.

# ALG STEP 1
# Initialize actor and critic networks
self.actor = FeedForwardNN(self.obs_dim, self.act_dim)
self.critic = FeedForwardNN(self.obs_dim, 1)

And we’re done with step 1! Officially done with 1/8 of PPO. Here’s the code so far:

Onto Step 2 now.

Step 2

Easy. They want us to define a for loop to learn for some number of iterations. Now we could loop by iterations, but we also know that Stable Baselines PPO2 makes you specify how many timesteps to train in total when calling learn. Let’s follow that design. This way, instead of counting off to infinite iterations, we can specify how many timesteps to train before we stop.

def learn(self, total_timesteps):
  t_so_far = 0 # Timesteps simulated so far  while t_so_far < total_timesteps:              # ALG STEP 2
    # Increment t_so_far somewhere below

Step 2, done. Here’s the code so far:

Step 3:

Step 3

Our first mini-challenge. We need to collect data from a set of episodes by running our current actor policy. Sure, sounds like a rollout to me. We can call our data collected in each rollout a batch. Now what data do we need? Let’s take a little look ahead in our pseudocode.

Looks like we’ll need observations per timestep as I see sₜ in steps 6 and 7. We’ll also need actions per timesteps with aₜ in steps 6 and 7, action probabilities with π_θ (aₜ | sₜ) in step 6, and rewards-to-go with Rₜ in step 4 and 7. Oh, and don’t forget that in order to increment t_so_far in learn, we’ll need to know how many timesteps are simulated per batch; let’s return the lengths of each episode run in our batch (not summing yet because it can be used for logging average episodic length later too. You can also just sum the episodic lengths before returning, doesn’t really matter).

We’ll also have to figure out how many timesteps to run per batch; sounds like a hyperparameter to me. We’ll first create a function _init_hyperparameters to define some default hyperparameters, and call the function from our __init__.

def __init__(self, env):
  ...
  self._init_hyperparameters()def _init_hyperparameters(self):
  # Default values for hyperparameters, will need to change later.
  self.timesteps_per_batch = 4800            # timesteps per batch
  self.max_timesteps_per_episode = 1600      # timesteps per episode

Next, let’s create a rollout function to collect our data.

def rollout(self):
  # Batch data
  batch_obs = []             # batch observations
  batch_acts = []            # batch actions
  batch_log_probs = []       # log probs of each action
  batch_rews = []            # batch rewards
  batch_rtgs = []            # batch rewards-to-go
  batch_lens = []            # episodic lengths in batch

In our batch, we’ll be running episodes until we hit self.timesteps_per_batch timesteps; in the process, we shall collect observations, actions, log probabilities of those actions, rewards, rewards-to-go, and lengths of each episode. We’ll need these for our PPO algorithm later. The respective shapes of each list will be:

observations: (number of timesteps per batch, dimension of observation)
actions: (number of timesteps per batch, dimension of action)
log probabilities: (number of timesteps per batch)
rewards: (number of episodes, number of timesteps per episode)
reward-to-go’s: (number of timesteps per batch)
batch lengths: (number of episodes)

For why we keep track of log probabilities instead of raw action probabilities, here is a resource that explains why and here is another. TL;DR: makes gradient ascent easier behind the scenes. Let’s write our generic gym rollout on one episode.

obs = self.env.reset()
done = Falsefor ep_t in range(self.max_timesteps_per_episode):
  
  action = self.env.action_space.sample()
  obs, rew, done, _ = self.env.step(action)  if done:
    break

Few things we need to change. We’re not sampling an action, but querying our actor network. We need to collect observations, actions, log probs, episodic rewards, and episodic length. We need to stop once we hit self.timesteps_per_batch. Let’s do that now, assuming we have some get_action function to help us query an action and its log prob.

# Number of timesteps run so far this batch
t = 0 while t < self.timesteps_per_batch:  # Rewards this episode
  ep_rews = []  obs = self.env.reset()
  done = False  for ep_t in range(self.max_timesteps_per_episode):
    # Increment timesteps ran this batch so far
    t += 1    # Collect observation
    batch_obs.append(obs)    action, log_prob = self.get_action(obs)
    obs, rew, done, _ = self.env.step(action)
  
    # Collect reward, action, and log prob
    ep_rews.append(rew)
    batch_acts.append(action)
    batch_log_probs.append(log_prob)    if done:
      break  # Collect episodic length and rewards
  batch_lens.append(ep_t + 1) # plus 1 because timestep starts at 0
  batch_rews.append(ep_rews)

Okay, so we need a get_action. Let’s go ahead and write that. During training, we’ll need a way to “explore” actions; we’ll use something called a “Multivariate Normal Distribution”. The idea is to have the actor network output a “mean” action on a forward pass, then create a covariance matrix with some standard deviation along the diagonals. Then, we can use this mean and stddev to generate a Multivariate Normal Distribution using PyTorch’s distributions, and then sample an action close to our mean. We’ll also extract the log probability of that action in the distribution. If you’re uncomfortable with Multivariate Normal Distributions, here’s a great lecture by Andrew Ng on it.

Note: actions will be deterministic when testing, meaning that the “mean” action will be our actual action during testing. However, during training we need an exploratory factor, which this distribution can help us with.

from torch.distributions import MultivariateNormaldef __init(self, env):
  ...
  # Create our variable for the matrix.
  # Note that I chose 0.5 for stdev arbitrarily.
  self.cov_var = torch.full(size=(self.act_dim,), fill_value=0.5)
  
  # Create the covariance matrix
  self.cov_mat = torch.diag(self.cov_var)def get_action(self, obs):  # Query the actor network for a mean action.
  # Same thing as calling self.actor.forward(obs)
  mean = self.actor(obs)  # Create our Multivariate Normal Distribution
  dist = MultivariateNormal(mean, self.cov_mat)  # Sample an action from the distribution and get its log prob
  action = dist.sample()
  log_prob = dist.log_prob(action)
  
  # Return the sampled action and the log prob of that action
  # Note that I'm calling detach() since the action and log_prob  
  # are tensors with computation graphs, so I want to get rid
  # of the graph and just convert the action to numpy array.
  # log prob as tensor is fine. Our computation graph will
  # start later down the line.
  return action.detach().numpy(), log_prob.detach()

Finally, back in our rollout function, we should convert our batch_obs, batch_acts, batch_log_probs, and batch_rtgs to tensors since we’ll need them in that form later to draw our computation graphs. Assume that we have a function compute_rtgs that will compute the rewards-to-go of the batch rewards. Funnily enough, finding the rewards-to-go is Step 4 in our algorithm:

Step 4

# Reshape data as tensors in the shape specified before returning
batch_obs = torch.tensor(batch_obs, dtype=torch.float)
batch_acts = torch.tensor(batch_acts, dtype=torch.float)
batch_log_probs = torch.tensor(batch_log_probs, dtype=torch.float)# ALG STEP #4
batch_rtgs = self.compute_rtgs(batch_rews)# Return the batch data
return batch_obs, batch_acts, batch_log_probs, batch_rtgs, batch_lens

Let’s figure out now how to calculate rewards-to-go. Typically when calculating rewards-to-go on a set of rewards from a single episode, you start from the end, have some variable to track the sum of rewards, multiply the variable by a discount factor (gamma) per timestep, add the variable with the immediate reward, and append it to some reward-to-go array. In case you’re fuzzy on how to calculate reward-to-go, or return, given some observation, here’s the formula.

where G is reward-to-go function, sₖ is our observation at timestep k, T is timesteps per episode, γ is discount factor, and R(sᵢ) is reward given some observation sᵢ.

We’ll apply this exact same workflow, except on multiple episodes (to keep the order consistent, we’ll need to iterate the episodes backward too).

def compute_rtgs(self, batch_rews):
  # The rewards-to-go (rtg) per episode per batch to return.
  # The shape will be (num timesteps per episode)
  batch_rtgs = []  # Iterate through each episode backwards to maintain same order
  # in batch_rtgs
  for ep_rews in reversed(batch_rews):    discounted_reward = 0 # The discounted reward so far    for rew in reversed(ep_rews):
      discounted_reward = rew + discounted_reward * self.gamma
      batch_rtgs.insert(0, discounted_reward)  # Convert the rewards-to-go into a tensor
  batch_rtgs = torch.tensor(batch_rtgs, dtype=torch.float)  return batch_rtgsdef _init_hyperparameters(self):
  ...
  self.gamma = 0.95

Finally, let’s call our rollout function in learn.

def learn(self, total_timesteps):
  ...
  while t_so_far < total_timesteps:
    # ALG STEP 3
    batch_obs, batch_acts, batch_log_probs, batch_rtgs, batch_lens = self.rollout()

And there we go! We’re done with Steps 3 and 4, and halfway done with our PPO implementation. Here’s the code so far:

get_action, compute_rtgs, _init_hyperparameters

Congratulations! We are already halfway through implementing a bare-bones PPO, and finished the majority of the code. In Part 3, we will finish up the PPO implementation.

If you have any questions up to this point, don’t hesitate to leave a comment or reach out to me at eyyu@ucsd.edu. Otherwise, see you in Part 3!

Coding PPO From Scratch With PyTorch (Part 2/4)

Written by Eric Yang Yu