Hi Arthur,
Apollo Lee

Hi Apollo,

Your code may not work because the loss function you are using is no longer dependent on the output of the network, but rather on the placeholders you feed it. Without using the “probability” tensor, Tensorflow won’t be able to send the gradients through the network to properly update it.

The purpose of the fake labels is to ensure that the credit is assigned for the action that is actually taken. Since there are only two actions, we know that the opposite action would have been better if there was a negative reward, and the mask allows us to take that into account.

Hope that is helpful,


One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.