Coding PPO from Scratch with PyTorch (Part 4/4)

8 min readAug 16, 2023

Welcome to Part 4 of our series, where we will briefly discuss some of the most common optimization tricks for Proximal Policy Optimization (PPO). If you haven’t read Part 1 and Part 2 and Part 3, please do so first.

Let’s get started!

Here’s the code if you want to take a look first: https://github.com/ericyangyu/PPO-for-Beginners

Learning Rate Annealing

It is a common technique to dynamically change the learning rate during the training loop. The goal of learning rate annealing is to find a balance between rapid progress in the initial stages of optimization and fine-tuning toward convergence as training progresses.

A higher learning rate can introduce instability into the optimization process, leading to divergence and overshooting of optimal parameter values. Conversely, a lower learning rate can cause the optimization process to progress slowly, or increase the likelihood of the algorithm becoming trapped in local minima. It is preferable if the learning rate changes over time instead of being a fixed value. We would like to change the learning rate when we do an update, so we put it in the optimization loop where the actor and critic networks are updated.

for _ in range(self.n_updates_per_iteration):
  frac = (t_so_far - 1.0) / total_timesteps
  new_lr = self.lr * (1.0 - frac)
  new_lr = max(new_lr, 0.0)
  self.actor_optim.param_groups[0]["lr"] = new_lr
  self.critic_optim.param_groups[0]["lr"] = new_lr

The frac term is calculated as the ratio of the current number of timesteps simulated, denoted as t_so_far, to the total planned timesteps, total_timesteps. By subtracting 1.0 from t_so_far before dividing by total_timesteps, the annealing process starts with a learning rate of self.lr and decreases as the training advances towards total_timesteps.

The learning rate new_lr is reduced proportionally using the formula self.lr * (1.0 — frac). Then we applied the new learning rate to both the actor and critic optimizer. By reducing the learning rate over time, we stabilize the optimization process, leading to more effective training of PPO.

Mini-batch Update

Currently, we are updating our actor and critic network in the unit of a batch. Training with the full dataset can be memory intensive, and computationally expensive, especially for large datasets and complex models. Also, the optimization process follows a fixed path, which might not allow for as much exploration of the parameter space. By using mini-batches, we won’t have memory issues as we only process a subset of data at a time, and our reduced computational load leads to faster computations. Although the stochastic nature of minibatch updates introduces randomness and noise, learning rate annealing can help strike a balance between accurate gradient estimation, exploration of the parameter space, and computational efficiency. With learning rate annealing, we can have faster convergence while benefiting from the noise introduced by mini-batch updates.

It is crucial to choose a minibatch size. Larger mini-batches may provide more accurate gradient estimates but can slow down computation, while smaller mini-batches introduce more noise but can lead to faster convergence. We will add the number of mini-batches as a hyperparameter in _init_hyperparameters later.

step = batch_obs.size(0)
inds = np.arange(step)
minibatch_size = step // self.num_minibatches
for _ in range(self.n_updates_per_iteration):
frac = (t_so_far - 1.0) / total_timesteps
    new_lr = self.lr * (1.0 - frac)
    new_lr = max(new_lr, 0.0)
    self.actor_optim.param_groups[0]["lr"] = new_lr
    self.critic_optim.param_groups[0]["lr"] = new_lr

    np.random.shuffle(inds)
    for start in range(0, step, minibatch_size):
        end = start + minibatch_size
        idx = inds[start:end]
        mini_obs = batch_obs[idx]
        mini_acts = batch_acts[idx]
        mini_log_prob = batch_log_probs[idx]
        mini_advantage = A_k[idx]
        mini_rtgs = batch_rtgs[idx]
        V, curr_log_probs, entropy = self.evaluate(mini_obs, mini_acts)
        logratios = curr_log_probs - mini_log_prob
        ratios = torch.exp(logratios)
        surr1 = ratios * mini_advantage
        surr2 = torch.clamp(ratios, 1 - self.clip, 1 + self.clip) * mini_advantage
        actor_loss = (-torch.min(surr1, surr2)).mean()
        critic_loss = nn.MSELoss()(V, mini_rtgs)

From the snippet above, we first determine the batch size so that we can calculate the size of each mini-batches. In the training loop, after learning rate annealing, we shuffle the indices so that we can divide the whole batches into mini-batches randomly. The following should be familiar since we only change everything from whole batch to mini-batches.

You might have noticed that the old evaluate function returns a new variable called entropy. That’s right, the next optimization trick is entropy regularization.

Entropy Regularization

Entropy regularization is a tool used to strike a balance between exploration and exploitation in reinforcement learning. By incorporating entropy into the loss function, the algorithm encourages policies to exhibit greater exploration and uncertainty, resulting in more versatile actions. From the above code, we modify the evaluate function to return not only the value and log probabilities but also the entropy of the batch:

def evaluate(self, batch_obs, batch_acts):
  V = self.critic(batch_obs).squeeze()
  mean = self.actor(batch_obs)
  dist = MultivariateNormal(mean, self.cov_mat)
  log_probs = dist.log_prob(batch_acts)
  return V, log_probs, dist.entropy()

After we have calculated the actor loss, we add these two lines:

entropy_loss = entropy.mean()
actor_loss = actor_loss - self.ent_coef * entropy_loss

We subtract the entropy loss from the actor loss with a coefficient, which is also a hyperparameter. A higher coefficient penalizes overly deterministic policies, pushing the process to explore different actions instead of convergence to the currently optimal solutions. This balance between exploration and exploitation contributes to the algorithm’s ability in complex environments.

Gradient Clipping

Gradient Clipping is a technique used during the training of neural networks to mitigate the issues of exploding gradients. When gradients become too large during backpropagation, it can lead to unstable training and hinder convergence.

self.actor_optim.zero_grad()
actor_loss.backward(retain_graph=True)
nn.utils.clip_grad_norm_(self.actor.parameters(), self.max_grad_norm)
self.actor_optim.step()

self.critic_optim.zero_grad()
critic_loss.backward()
nn.utils.clip_grad_norm_(self.critic.parameters(), self.max_grad_norm)
self.critic_optim.step()

When calculating gradients and performing backward propagation for both networks, we add nn.utils.clip_grad_norm_() which calculates the L2 norm of the gradients and scales them down if the norm surpasses the given threshold. This helps in maintaining stable gradient updates and smoother convergence during the optimization process. Documentation can be found here. Usually, self.max_grad_norm would be set to 0.5.

Approximating KL Divergence

This trick measures the difference between the updated policy distribution and the old policy distribution. It is calculated from ratios and log-ratios.

approx_kl = ((ratios - 1) - logratios).mean()

After every update for all mini-batches, we check whether the difference (approx_kl) exceeds the threshold (self.target_kl), if so, we break out of the loop.

if approx_kl > self.target_kl:
    break

For more information about why this is a good and unbiased estimator of KL, check here!

Generalized Advantage Estimation (GAE)

GAE is used to compute Advantages for the collected data. Originally, we compute the Advantages from subtracting Values from Reward-to-go. We had to compute Reward-to-go first to get Advantages. By using GAE, we can handle reward delays and noisy rewards to better estimate the true value of actions.

Here is the result from the paper:

GAE introduces two key concepts to compute advantages effectively:

Temporal Difference Residuals (TD Residuals): These are the differences between the observed rewards and the estimated values of states. TD residuals capture the immediate impact of an action and provide information about how much better or worse the actual outcome was compared to the expected outcome.

Lambda Parameter (λ): Lambda determines the balance between short-term and long-term advantages. It weighs the contribution of TD residuals from different time steps, allowing GAE to consider both immediate rewards and potential future rewards. A value of λ close to 1 gives more weight to future rewards, whereas a value closer to 0 gives more weight to immediate rewards.

Noticed that we should start from the end so that we don’t need to compute the same quantity over and over again.

Our GAE function looks like this:

def calculate_gae(self, rewards, values, dones):
    batch_advantages = []
    for ep_rews, ep_vals, ep_dones in zip(rewards, values, dones):
        advantages = []
        last_advantage = 0

        for t in reversed(range(len(ep_rews))):
            if t + 1 < len(ep_rews):
                delta = ep_rews[t] + self.gamma * ep_vals[t+1] * (1 - ep_dones[t+1]) - ep_vals[t]
            else:
                delta = ep_rews[t] - ep_vals[t]

            advantage = delta + self.gamma * self.lam * (1 - ep_dones[t]) * last_advantage
            last_advantage = advantage
            advantages.insert(0, advantage)

        batch_advantages.extend(advantages)

    return torch.tensor(batch_advantages, dtype=torch.float)

The inputs are batch rewards, values and dones. For each episode’s rewards, values, and done flags, the algorithm computes advantages that take into account the temporal relationships between states and rewards. Using reversed() to start from the end and work backward, GAE calculates advantages by considering the difference between the observed rewards and the estimated values of the current and future time steps. This difference is weighted by a combination of factors including the discount factor (gamma) and the lambda (lam), which helps balance the influence of short-term and long-term rewards.

Noticed that besides rewards and values, we have a done flag. This flag is essential in computing GAE in PPO because it ensures we let the algorithm knows when an episode ends so that we only take into account the rewards and values of this current episode without using future data.

We would also need to get the state values and done flags during the rollout.

After rollout, we used the data we collected to calculate GAE, and compute Reward-to-go from Advantages and Values.

In essence, GAE enhances the stability, efficiency, and convergence properties of PPO by providing a more accurate and informative measure of the quality of actions.

Besides all of the tricks we discussed above, there are still many things left for you to explore. Fine-tuning the hyperparameters can lead to different performances in the same environment. For example, you can try to adjust the number of mini-batches, which would alter the size of mini-batches, or you can try using larger or smaller entropy coefficients. Generally, with these tricks, your algorithm would perform better and be more stable.

Below are some comparisons of bare-bone PPO vs. Stable Baseline2, and our new PPO vs. Stable Baseline3 on the same environments with the same corresponding seeds.

(Note: Pendulum-v1 is a new version of Pendulum-v0)

Here are the comparisons: