Understanding Entropy, Cross-Entropy and Cross-Entropy Loss
Cross Entropy loss is one of the most widely used loss function in Deep learning and this almighty loss function rides on the concept of Cross Entropy. When I started to use this loss function, it was hard for me to get the intuition behind it. After Googling a bit and munching on the concepts I got from different sources, I was able to get a satisfactory understanding and I would like to share it in this article.
In order to develop complete understanding, we need to understand concepts in the following order: Surprisal, Entropy, Cross-Entropy, Cross Entropy Loss
Surprisal:
“Degree to which you are surprised to see the result”
Now its easy to digest my word when I say that I will be more surprised to see an outcome with low probability in comparison to an outcome with high probability. Now, if yi is the probability of ith outcome then we could represent surprisal (s) as:
Entropy:
Since I know surprisal for individual outcomes, I would like to know surprisal for the event. It would be intuitive to take a weighted average of surprisals. Now the question is what weight to choose? Hmmm…since I know the probability of each outcome, taking probability as weight makes sense because this is how likely each outcome is supposed to occur. This weighted average of surprisal is nothing but Entropy (e) and if there are n outcomes then it could be written as:
Cross-Entropy:
Now, what if each outcome’s actual probability is pi but someone is estimating probability as qi. In this case, each event will occur with the probability of pi but surprisal will be given by qi in its formula (since that person will be surprised thinking that probability of the outcome is qi). Now, weighted average surprisal, in this case, is nothing but cross entropy(c) and it could be scribbled as:
Cross-entropy is always larger than entropy and it will be same as entropy only when pi=qi. You could digest the last sentence after seeing really nice plot given by desmos.com
Cross-Entropy Loss:
In the plot I mentioned above, you will notice that as estimated probability distribution moves away from actual/desired probability distribution, cross entropy increases and vice-versa. Hence, we could say that minimizing cross entropy will move us closer to actual/desired distribution and that is what we want. This is why we try to reduce cross entropy so that our predicted probability distribution end up being close to the actual one. Hence, we get the formula of cross-entropy loss as:
And in the case of binary classification problem where we have only two classes, we name it as binary cross-entropy loss and above formula becomes: