Deep Reinforcement Learning-PG 13.0


  1. To learn a parameterized policy that can select actions without consulting a value function.
  2. We want to learn policy weight vector


Gradient of some performance measure

These method seek to maximize performance, so their updates approximate gradient ascent in eta:

PG 法主要是要最大化效能,所以用梯度上升法 gradient ascent

All methods that follow this general schema called policy gradient methods.


To learn both policy and value functions are called actor-critic method.

( the learned policy - the learned value function = actor — critic )


For the episodic case, performance is defined as the value of start state under the parameterized policy:

For the continuous case, performance is defined as the average reward rate: