# Understanding PPO Plots in TensorBoard

OpenAI Baselines and Unity Machine Learning have TensorBoard integration for their Proximal Policy Optimization (PPO) algorithms. It’s helpful to plot and visualize as much as possible in Reinforcement Learning (RL). Doing so can aid in debugging, lead to better insights, find things to explore, and highlight problems. This post will summarize the default plots, try to provide explanations, and ask for help in trying to understand some of the metrics plotted.

Unity provides an explanation of its PPO implementation with TensorBoard, a sample image of the plots (see above), an explanation for each plot (sometimes an alternate explanation), and what to look for. Below are the aggregated explanations with alternate explanations italicized.

• Lesson — Plots the progress from lesson to lesson. Only interesting when performing curriculum training.
• Cumulative Reward — The mean cumulative episode reward over all agents. Should increase during a successful training session. The general trend in reward should consistently increase over time. Small ups and downs are to be expected. Depending on the complexity of the task, a significant increase in reward may not present itself until millions of steps into the training process.
• Entropy — How random the decisions of the model are. Should slowly decrease during a successful training process. If it decreases too quickly, the `beta` hyperparameter should be increased. This corresponds to how random the decisions of a Brain are. This should consistently decrease during training. If it decreases too soon or not at all, `beta` should be adjusted (when using discrete action space).
• Episode Length — The mean length of each episode in the environment for all agents.
• Learning Rate — How large a step the training algorithm takes as it searches for the optimal policy. Should decrease over time. This will decrease over time on a linear schedule.
• Policy Loss — The mean magnitude of policy loss function. Correlates to how much the policy (process for deciding actions) is changing. The magnitude of this should decrease during a successful training session. These values will oscillate during training. Generally they should be less than 1.0.
• Value Estimate — The mean value estimate for all states visited by the agent. Should increase during a successful training session. These values should increase as the cumulative reward increases. They correspond to how much future reward the agent predicts itself receiving at any given point.
• Value Loss — The mean loss of the value function update. Correlates to how well the model is able to predict the value of each state. This should increase while the agent is learning, and then decrease once the reward stabilizes. These values will increase as the reward increases, and then should decrease once reward becomes stable.

Unlike Unity, there is not a formal guide for these measurements. John Schulman’s excellent lecture addresses some of these and the Baselines code addresses others. For the rest, the explanations are the best of my understanding.

• approxkl — Approximate Kullback–Leibler divergence measure of the old policy from the new policy. I’m not sure how to diagnose this or why they are using an approximation instead of the actual KL. Maybe this metric works the same as how Schulman recommended using KL:
• policy_loss — The loss of the policy gradients part of PPO. This along with the KL loss (optional), value function loss and entropy regularization make up the loss for PPO. Also see Unity section.
`# Calculate ratio (pi current policy / pi old policy)        ratio = tf.exp(OLDNEGLOGPAC - neglogpac)         # Defining Loss = - J is equivalent to max J        pg_losses = -ADV * ratio         pg_losses2 = -ADV * tf.clip_by_value(ratio, 1.0 - CLIPRANGE, 1.0 + CLIPRANGE)         # Final PG loss        pg_loss = tf.reduce_mean(tf.maximum(pg_losses, pg_losses2))`
• clipfrac — Fraction of times the clip range hyperparameter is used. PPO clips the new policy within the clip range of the old policy, allowing stable learning. I’m not sure what to look for in plots of this (other than for guidance in choosing the clip range hyperparameter).
`# Calculate ratio (pi current policy / pi old policy)        ratio = tf.exp(OLDNEGLOGPAC - neglogpac)clipfrac = tf.reduce_mean(tf.to_float(tf.greater(tf.abs(ratio - 1.0), CLIPRANGE)))`
`# Calculate the fps (frame per second)fps = int(nbatch / (tnow - tstart))`
• nupdates — Timesteps divided by number of batches.
`nupdates = total_timesteps//nbatch`