Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO)
Policy gradient methods are fundamental to using neural networks for control. But they are very sensitive to choice of step size — too small the progress is small and too large and the response will be too noisy, making it very challenging.
Policy gradients also have a very poor sample efficiency, taking a huge number of steps to optimize. The most common approaches to solve this problem are TRPO, ACER and PPO, which constrains or optimizes the size of a policy update.
- ACER requires additional code for off-policy corrections and a replay buffer and is more complicated than PPO but does marginally better than PPO.
- TRPO is useful for continuous control tasks but isn’t easily compatible with algorithms that share parameters between a policy and a value function (where visual input is significant).
Here,
So what does Policy Gradients do?
It is an optimization problem that allow you to do a small update to policy based on data sampled from policy (on-policy data)