PPO Hyperparameters and Ranges

AurelianTactics
aureliantactics
Published in
6 min readJul 25, 2018
Clip parameter illustration from Schulman et al https://arxiv.org/pdf/1707.06347.pdf

Proximal Policy Optimization (PPO) is one of the leading Reinforcement Learning (RL) algorithms. PPO is the algorithm powering OpenAI Five, which recently beat a group of experienced Dota 2 players and will challenge some former professional players in a couple of weeks. A PPO variant — Joint PPO — won the OpenAI Retro Contest. Helping PPO’s spread are open-source implementations like OpenAI’s baselines, TensorForce, RLlib, and Unity ML Agents.

PPO is a policy gradients method that makes policy updates using a surrogate loss function to avoid catastrophic drops in performance. The algorithm is robust in that hyperparameter initializations are a bit more forgiving and it can work out of the box on a wide variety of RL tasks. However proper hyperparameter initialization and search can still lead to improved results. This post will be an explanation of the hyperparameters and their ranges as used in the small number examples I gathered. If you come across more hyperparameters used in tuned, successful PPO experiments please let me know. I hope this post can be somewhat useful until an experienced PPO user writes a definitive guide.

Hyperparameters: Epochs, Minibatches, Horizon

Some background. A policy is a set of actions an RL agent can take. In policy gradient methods generally the agent starts with an initial policy, interacts with the environment, gets a reward, and the policy is improved using that reward. This results in a new policy. Policy gradient algorithms typically have two steps. In the first step, transitions are gathered. In the second step the policy is improved. There are two main issues: how much experience should the agent gather before updating the policy and how to actually update the old policy to the new policy.

The experience issue arises because policy gradient methods gather transitions (sequences of states, rewards, and actions) and use the transitions to update the policy. The old transitions are thrown away and new transitions are gathered using the new policy. This is why on-policy, policy gradient methods are typically less transition efficient than off-policy methods like DQN (which store and reuse transitions in a replay buffer). This leads to our first grouping of hyperparameters that deal with experience collection: epochs, minibatches, horizon.

PPO gathers trajectories as far out as the horizon limits, then performs a stochastic gradient descent (SGD) update of minibatch size on all the gathered trajectories for the specified number of epochs.

Horizon Range: 32 to 5000

Horizon also known as: horizon (PPO paper and RLlib), nsteps (ppo2 baselines), timesteps_per_actorbatch (ppo baselines), time_horizon (Unity ML), (TensorForce: unclear)

Another point to consider is the balance of horizon with the discount factor gamma as discussed in the OpenAI Five blog post. How far out should rewards in the future influence the policy?

Minibatch Range: 4 to 4096 (can be much higher with distributed implementations)

Minibatch also known as: minibatch size (PPO paper), timesteps_per_batch (RLlib), nminibatches (ppo2 baselines), optim_batchsize (ppo baselines), batch_size (Unity ML), (TensorForce: unclear)

Epoch Range: 3 to 30

Epoch also known as: Num. epochs (PPO paper), num_sgd_iter (RLlib), noptepochs (ppo2 baselines), optim_epochs (ppo baselines), num_epoch (Unity ML), (TensorForce: unclear)

Hyperparameters: Clipping, Gamma, Lambda, KL Target

Let’s return to the second major issue with policy gradients: how to actually update the new policy from the old policy. If the policy is updated in too large a step, policy performance can collapse drastically and never recover. PPO uses a surrogate loss function to keep the step from the old policy to the new policy within a safe range. From the PPO paper:

Surrogate Objectives Analyzed in PPO Paper

PPO uses either the second line, third line, or combination of the two lines depending on the implementation. The clip parameter is epsilon in the the second line implementation.

Clipping Range: 0.1, 0.2, 0.3

Clipping also known as: Clipping parameter epsilon (PPO Paper), clip_param (RLlib), cliprange (ppo2 baselines), clip_param (ppo baselines), epsilon (Unity ML), likelihood_ratio_clipping (TensorForce)

The KL penalty implementation (third line in the above picture) is available in RLlib’s PPO implementation. The parameters kl_coeff (initial coefficient for KL divergence) and kl_target can be used for the KL implementation.

KL Target Range: 0.003 to 0.03

KL Initialization Range: 0.3 to 1

The capital A hat symbols in the above picture is the advantage function, which alters the reward stream with the parameters gamma and lambda as outlined in the Generalized Advantage Estimation (GAE) paper. Lambda and gamma perform a bias-variance trade off of the trajectories and can also be viewed as a form of reward shaping. GAE at a glance:

Excerpt from GAE paper

Discount Factor Gamma Range: 0.99 (most common), 0.8 to 0.9997

Discount Factor Gamma also known as: Discount (gamma) (PPO Paper), gamma (RLlib), gamma (ppo2 baselines), gamma (ppo baselines), gamma (Unity ML), discount (TensorForce)

GAE Parameter Lambda Range: 0.9 to 1

GAE Parameter Lambda also known as: GAE Parameter (lambda) (PPO Paper), lambda (RLlib), lambda (ppo2 baselines), lambda (ppo baselines), lambda (Unity ML), gae_lambda (TensorForce)

Hyperparameters: Value Function and Entropy Coefficients

In addition to the surrogate loss functions discussed above, PPO contains two other losses in the objective function.

PPO Objective. Note c1 and c2.

c1 is the value function coefficient and c2 is the entropy coefficient. Explanation for the Value Function loss (2nd term) from the PPO paper:

If using a neural network architecture that shares parameters between the policy and value function, we must use a loss function that combines the policy surrogate and a value function error term.

Value Function Coefficient Range: 0.5, 1

Value Function Coefficient also known as: VF coeff. (PPO Paper), vf_loss_coef (RLlib), vf_coef (ppo2 baselines), (ppo baselines: unclear), (Unity ML: unclear), (TensorForce: unclear)

The entropy coefficient is a regularizer. A policy has maximum entropy when all policies are equally likely and minimum when the one action probability of the policy is dominant. The entropy coefficient is multiplied by the maximum possible entropy and added to loss. This helps prevent premature convergence of one action probability dominating the policy and preventing exploration. Denny Britz’s implementation on a different RL algorithm shows how the entropy coefficient can be used as a regularizer in general:

self.entropy = -tf.reduce_sum(self.probs * tf.log(self.probs), 1)
...
self.losses = - (tf.log(self.picked_action_probs) * self.targets + 0.01 * self.entropy)
self.loss = tf.reduce_sum(self.losses, name="loss")

In the above example, the 0.01 is the entropy coefficient.

Entropy Coefficient Range: 0 to 0.01

Entropy Coefficient also known as: Entropy coeff. (PPO Paper), entropy_coeff (RLlib), ent_coeff (ppo2 baselines), entcoeff (ppo baselines), beta (Unity ML), entropy_regularization (TensorForce)

Other Hyperparameters: Ending Condition and Learning Rate

The last section features some more general hyperparameters that can be used in many Deep Learning experiments. An obvious one is how long to run the experiment for. This can be a set number of timesteps or until some end condition is met (like average reward of past 100 episodes exceeding some threshold). Another is the learning rate of the optimizer. PPO uses Adam optimizer. Some implementations also have a hyperparameter for Adam epsilon.

Learning Rate Range: 0.003 to 5e-6

Learning Rate also known as: Adam stepsize (PPO Paper), sgd_stepsize (RLlib), lr (ppo2 baselines), (ppo baselines: unclear), learning_rate (Unity ML), learning_rate (TensorForce)

--

--