Hi Hanyoung,

Arthur Juliani

12

Hi,

I am confused about the entropy term too.

Say you have two outputs then based on (-tf.reduce_sum(self.policy * tf.log(self.policy))

(0.1,0.9): -(0.1*ln(0.1)+0.9*ln(0.9)) = 0.33 (good)

(0.5,0.5): -(0.5*ln(0.5)+0.5*ln(0.5)) = 0.69 (bad)

However, when you look at the loss function it would imply that a larger entropy term would minimise loss because of the negative term (ie, the network would optimise towards (0.5,0.5)):

loss = 0.5 * self.value_loss + self.policy_loss — self.entropy * 0.01

I would have thought it would have been:

loss = 0.5 * self.value_loss + self.policy_loss + self.entropy * 0.01