Your code may not work because the loss function you are using is no longer dependent on the output of the network, but rather on the placeholders you feed it. Without using the “probability” tensor, Tensorflow won’t be able to send the gradients through the network to properly update it.
The purpose of the fake labels is to ensure that the credit is assigned for the action that is actually taken. Since there are only two actions, we know that the opposite action would have been better if there was a negative reward, and the mask allows us to take that into account.
Hope that is helpful,