Using Deep Policy Gradient to play Pong — Part 5

  1. Pre-process observations.(input images).

2. Build the policy network and sample an action (every action).

3. Prepare for backprop (every action).

4. Backprop (every game(episode)).

  • f(x): score function
  • p(x): policy network, p(a|I) which is a distribution over actions for any image I.

Question is: how do we change the network’s parameters so that action samples get higher rewards.

policy backward
1. ravel(): 2 dim to 1 dim