The reason advantage is fed in separately is because we don’t actually want the gradients from advantage to change the value estimate. We only want them to affect the policy probabilities. This implementation also uses “generalized advantage estimate” where the values we use are weighted in a specific way, and as such can’t be taken directly from the network output.

