I am confused about the entropy term too.
Say you have two outputs then based on (-tf.reduce_sum(self.policy * tf.log(self.policy))
(0.1,0.9): -(0.1*ln(0.1)+0.9*ln(0.9)) = 0.33 (good)
(0.5,0.5): -(0.5*ln(0.5)+0.5*ln(0.5)) = 0.69 (bad)
However, when you look at the loss function it would imply that a larger entropy term would minimise loss because of the negative term (ie, the network would optimise towards (0.5,0.5)):
loss = 0.5 * self.value_loss + self.policy_loss — self.entropy * 0.01
I would have thought it would have been:
loss = 0.5 * self.value_loss + self.policy_loss + self.entropy * 0.01