Hi Arthur, first of all, tks for your fantastic work!!
Larry Guo
1

Hi Larry,

  1. input_y — probability is used to determine the direction of the gradient to move in depending on the action. * advantage then adjusts the loss depending on the positive or negative reward.
  2. This line collects the gradients for all the variables. In this case, there are only two trainable variables (W1 and W2). Once we collect the gradients, we send them back into the network for updating once we have accumulated enough traces.

Hope that makes sense!