Deep Reinforcement Learning-PG 13.0
- To learn a parameterized policy that can select actions without consulting a value function.
- We want to learn policy weight vector
Gradient of some performance measure
These method seek to maximize performance, so their updates approximate gradient ascent in eta:
PG 法主要是要最大化效能，所以用梯度上升法 gradient ascent
All methods that follow this general schema called policy gradient methods.
To learn both policy and value functions are called actor-critic method.
( the learned policy - the learned value function = actor — critic )
For the episodic case, performance is defined as the value of start state under the parameterized policy:
For the continuous case, performance is defined as the average reward rate: