…d environment are building an entirely different product than one that can operate on public roads. Operating in an environment without outliers doesn’t improve the software, it confines the self-driving car software to the simplest of use cases.
…nuous actions with policy gradient, broadening the application of RL to more tasks such as control. TRPO improves the performance of DDPG as it introduces a surrogate objective function and a KL divergence constraint, guaranteeing non-decreasing long-term reward. PPO further optimizes TRPO by modifying the surrogate objective function, which improves the perfo…
The idea of TRPO’s constraint is disallowing the policy to change too much. Therefore, instead of adding a constraint, PPO slightly modifies TRPO’s objective function with a penalty for having a too large policy update.