Using Deep Policy Gradient to play Pong — Part 5
- Pre-process observations.(input images).
2. Build the policy network and sample an action (every action).
3. Prepare for backprop (every action).
4. Backprop (every game(episode)).
- f(x): score function
- p(x): policy network, p(a|I) which is a distribution over actions for any image I.
Question is: how do we change the network’s parameters so that action samples get higher rewards.
1. ravel(): 2 dim to 1 dim