Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO)

Sanket Gujar
6 min readApr 21, 2018

Policy gradient methods are fundamental to using neural networks for control. But they are very sensitive to choice of step size — too small the progress is small and too large and the response will be too noisy, making it very challenging.

Policy gradients also have a very poor sample efficiency, taking a huge number of steps to optimize. The most common approaches to solve this problem are TRPO, ACER and PPO, which constrains or optimizes the size of a policy update.

  • ACER requires additional code for off-policy corrections and a replay buffer and is more complicated than PPO but does marginally better than PPO.
  • TRPO is useful for continuous control tasks but isn’t easily compatible with algorithms that share parameters between a policy and a value function (where visual input is significant).

Here,

So what does Policy Gradients do?

It is an optimization problem that allow you to do a small update to policy based on data sampled from policy (on-policy data)

--

--

Sanket Gujar

Computer Science Graduate Student at WPI, Former Perception Intern at Luminar tech, PA. sanketgujar.github.io