Policy Optimizations: TRPO/PPO

Cheng Xi Tsou
Geek Culture
Published in
10 min readSep 17, 2021

In this post, I will be talking about policy optimization methods from the papers Trust Region Policy Optimization (Schulman et al. 2015) and Proximal Policy Optimization Algorithms (Schulman et al. 2017). I will then briefly go over the Trust Region Policy Optimization method and implement two types of Proximal Policy Optimization methods: adaptive KL (Kullback-Leibler) penalties to the surrogate objective and clipped surrogate objective.

Introduction

In a traditional policy gradient method, we sample a trajectory of states, actions, and rewards, then update the policy using the sampled trajectories. While this method is great and solves basic control problems, the algorithm tends to be unstable and is inconsistent in solving an environment. A problem is that as we are updating the policy, the distribution of the inputs and outputs of the approximated policy distribution will change, resulting in instability.

Another problem is that when performing gradient ascent, we do not guarantee that the policy is being updated in the right direction. The advantage function is initialized randomly, so updating the policy with respect to the advantage can cause degradation instead of improvement. A solution to this would be to create some sort of safe distance between the old policy and the new policy such that updating the policy will guarantee monotonic improvement and not diverge.

Trust Region Policy Optimization

Let’s start by introducing the concept of trust regions from the TRPO paper.

Trust Region Policy Optimization (Schulman et al. 2015)

Here, we have η, the expected discounted returns following policy π. We can use this expected discounted reward in the following identity that expresses the expected discounted returns for a policy in terms of the advantage over another policy (Kakade & Langford 2002):

Trust Region Policy Optimization (Schulman et al. 2015)

We use the standard definition of advantage function A, where A(s, a)= Q(s, a)- V(s), with Q being the value of taking action a in state s and V being the value of state s. The identity calculates the advantage of choosing policy π’ over π by calculating the advantage under π distribution while sampling from the π’ policy. Then, we rewrite this identity to be the sum over states and actions instead of timesteps:

Trust Region Policy Optimization (Schulman et al. 2015)

In the last line, we define ρ_π(s) to be the discounted visitation frequencies for state s under policy π so we can get rid of the timesteps. From this equation, we can see that as long as the advantage is non-negative for all states s, then updating the policy from π → π’ will guarantee a non-negative improvement. The paper then introduces a local approximation of this identity:

Trust Region Policy Optimization (Schulman et al. 2015)

Instead of sampling state transitions from the new policy, we use the distribution from the old policy. This is because calculating the new policy’s discounted visitation frequencies is too complex as you would need to simulate the collected trajectories again on the new policy. With this approximation, we can just simply use the collected trajectories before updating the policy.

Since this is a local approximation, this does not mean that L can globally approximate η. The paper derives a lower bound, or trust region for which the approximation is accurate so that we know how big of a step we can take towards the new policy where the approximation is viable.

Taken from UWaterloo CS885 Lecture 15a

Kakade and Langford derived the lower bound as the following:

Trust Region Policy Optimization (Schulman et al. 2015)

The penalty uses the function D_KL, which is the Kullback-Leibler divergence between π and π’, used to measure the distance between two distributions for an arbitrary variable x. Here’s a visual of the local approximation:

Taken from UWaterloo CS885 Lecture 15a

Now with our surrogate function and trust-region lower bound, we can come up with an algorithm that guarantees non-decreasing expected return η:

Trust Region Policy Optimization (Schulman et al. 2015)

We know that this algorithm guarantees improvement by looking at the surrogate objective with the derived lower bound. If we improve the surrogate function on the right-hand side, that will mean we improve the expected return η. Here, we maximize the surrogate objective at every iteration, so we know that the new policy’s surrogate objective will be at least just as good, so the expected discounted returns will have a non-negative improvement.

The TRPO paper then proposes a more practical way to implement the algorithm when we parameterize the policy π. If we use the penalty constant C as proposed in algorithm 1, the update steps will be very small. Instead, we will maximize the surrogate objective without a penalty, which is the derived lower bound to compensate for the local approximation, and use a trust-region constraint on the KL divergence between the old and new policies.

Trust Region Policy Optimization (Schulman et al. 2015)

In practice, it is impractical to guarantee that the KL divergence follows the constraint at every possible state, so instead, we use an approximation where as long as the average KL divergence is within the constraints, it is fine. Next, let’s try to rewrite the surrogate objective function. Recall that this is the current optimization problem we are trying to solve:

Trust Region Policy Optimization (Schulman et al. 2015)

First, we can get rid of the summation of states by using an expectation that samples from the state distribution and the state visitation frequencies by using an infinite geometric sequence (1/1-γ) instead. Then, we replace the advantage function with a Q function which doesn’t change the objective as the only difference is a constant. Finally, we can get rid of the summation of actions by sampling from the action distribution and using importance sampling. Now we have the following objective:

Trust Region Policy Optimization (Schulman et al. 2015)

The TRPO paper uses this objective with a Q function instead and comes up with a practical algorithm that solves the constraint problem at each iteration. I won’t go into detail about the TRPO algorithm as it is too computationally expensive and complex compared to its successor, PPO.

Proximal Policy Optimization

Now with the background knowledge of TRPO, we can discuss the proposed changes in the PPO paper that simplifies the constraint optimization problem in TRPO which can be complex and computationally expensive to solve. Recall the objective that TRPO tries to maximize:

Proximal Policy Optimization Algorithms (Schulman et al. 2017)

Here, we are just defining rₜ(θ) as the importance sampling ratio between the old and new policies. The PPO paper proposed a new kind of objective: clipped surrogate objective.

Proximal Policy Optimization Algorithms (Schulman et al. 2017)

Without a constraint, the surrogate objective can be blown out of proportion. However, instead of using a constraint on the surrogate objective, we instead constrain the ratio between the old and new policy in the range of [1 - ϵ, 1 + ϵ], with ϵ as a hyperparameter constant that dictates how far the ratio can deviate from 1. Then, we take the minimum between the clipped and unclipped objectives. Intuitively, we can think of the clipping like this:

If the advantage is positive, we would want the ratio to be >1 since we want the new policy to happen more often. The clip will then constrain the ratio under 1 + ϵ so the new policy does not stray too far away from the old policy. If the advantage is negative, we would want the ratio to be <1 since we want the new policy to happen less often. The clip will then make sure the ratio is over 1 - ϵ, so the new policy does not stray too far away from the old policy.

Another kind method that the paper proposes is by using an adaptive KL penalty. Recall the derived lower bound for the deviation of the local approximation from the true expected discounted returns. Like in TRPO, we will use a penalty on the surrogate objective. However, instead of using a calculated constant C which results in small updates, we use a hyperparameter constant β. Here is the surrogate objective with the penalty:

Proximal Policy Optimization Algorithms (Schulman et al. 2017)

Since it's hard to choose a β that encompasses all situations, we can use an adaptive penalty instead. After several epochs of policy updates, we change the β constant under the following conditions (Let d = KL divergence between old and new policy):

Proximal Policy Optimization Algorithms (Schulman et al. 2017)

We choose a target KL divergence of our new and old policy. If the KL divergence is under our target, we can lessen the penalty on the surrogate objective. If the KL divergence is over our target, we will increase the penalty on the surrogate objective.

With the clipped surrogate objective or one with an adaptive KL penalty, we can modify the objective a bit more in practice. If we were using a neural network structure that shared its parameters between actor and critic as we need a critic to estimate advantage, we can add two more terms to the objective function.

Proximal Policy Optimization Algorithms (Schulman et al. 2017)

Here, the surrogate objective uses the clipped objective, though it works with the adaptive KL penalty as well. We have weights c1 for the value function loss (typically mean-squared error loss), and c2 for the entropy bonus S. Putting everything together, we have the PPO algorithm using an Actor-Critic structure:

Proximal Policy Optimization Algorithms (Schulman et al. 2017)

Implementation

For my implementation, I used the Actor-Critic structure with shared parameters for my training loop, computed my advantage estimates with a value network, and implemented the two optimization methods, clipped and KL adaptive penalty. Then, I trained my PPO agent using both optimization methods on the CartPole-v1 and LunarLander-v2 environments. Here’s a quick recap of the two environments.

CartPole is a classic control problem where an agent tries to balance a pole on a cart. The environment has a continuous state space with 4 variables, cart velocity/acceleration, and angular velocity/acceleration, and a discrete action space with actions push left or right. The agent is rewarded for each step the pole is balanced.

LunarLander-v2 is a rocket trajectory optimization problem where an agent tries to land a rocket on a surface. The environment has a continuous state space with 8 variables, x position, y position, horizontal velocity, vertical velocity, lander orientation angle, angular velocity, left leg touching the ground (boolean), and right leg touching the ground (boolean), and a discrete action space with actions fire main engine, left engine, right engine, and no action. The agent is rewarded for landing correctly on the surface, and within the landing pad, and penalized for fuel usage.

For my Actor-Critic network, I used the same layers as the PPO paper, with two hidden layers of 64 units and tanh activation function. Here are the hyperparameters and experiment setup I used for training the agent:

  • γ (discount factor): 0.99
  • ϵ (surrogate clip): 0.2
  • β (initial KL penalty): 1
  • δ (target KL divergence): 0.01
  • c1 (value loss weight): 0.5
  • c2 (entropy weight): 0.01
  • k_epoch (number of training epochs in an update): 40
  • α_θ (actor learning rate): 0.0003
  • αv (critic learning rate): 0.001
  • total training steps: 300,000
  • max episode steps: 400
  • batch size: 1600

Here’s the training history of CartPole with clipped surrogate objective:

The agent was able to solve the environment in 30,000 steps with an average reward of 242.26 over 50 test runs. The training history looks a bit unstable, but I suspect this is because I didn’t do any kind of hyperparameter search. Another reason could be because of the way I computed my advantage. The PPO paper uses generalized advantage estimation (Schulman et al. 2015) to calculate the advantage, but I just subtracted a learning value baseline from the cumulative discounted returns. After training the agent again but without early stopping, the agent is able to achieve the max average score of 400 with 100,000 training steps.

I used tried to train the agent with an adaptive KL penalty but the results were inconsistent and unstable, though it was still able to solve the CartPole environment.

Next, here’s the training history and video of LunarLander with clipped surrogate objective:

The agent was trained for 300,000 steps with an average reward of 142.4 over 50 test runs. As with the CartPole environment, the training history is a bit unstable due to the aforementioned factors. Although a testing score of 142.4 is not considered solved, the LunarLander is able to smoothly land on the surface and does much better than a random policy. With a hyperparameter sweep, the PPO agent should be able to achieve a solved score of over 200. Here’s a video of the agent in action:

LunarLander-v2 trained with PPO

Code: https://github.com/chengxi600/RLStuff/blob/master/Policy%20Optimization%20Algorithms/PPO_Discrete.ipynb

References:

--

--