Thanks for the implementations. A good work. But I don’t understand why VPG get better performance than PPO or TRPO. I think VPG is a policy gradient algorithm while the other two are Actor Critic algorithm. It should be PPO or TRPO that will have better convergence properties.