Hi Arthur,
Pengcheng Xu


The reason advantage is fed in separately is because we don’t actually want the gradients from advantage to change the value estimate. We only want them to affect the policy probabilities. This implementation also uses “generalized advantage estimate” where the values we use are weighted in a specific way, and as such can’t be taken directly from the network output.

I hope that makes things a little clearer.

Show your support

Clapping shows how much you appreciated Arthur Juliani’s story.