Cross Entropy and KL Divergence
If 1,2,….n represent a discrete set of possible events with probabilities p1,p2,….pn Shanon asked the following question
“How uncertain are we about the outcome of an event coming from above Probability Distribution ?”
Example: If we have a coin which falls on head with probability 1 and tails with probability 0, then there is no uncertainty associated with it. Uncertainty reaches maximum when probability of seeing head or tail is 0.5.
Shanon wanted to quantify this uncertainty and formulated it as
H = -∑pi*log(pi) = ∑pi*log(1/pi)
KL (Kullback and Leibler) Divergence defines how similar a probability distribution p is with respect to q. i.e Give a sample from probability distribution p, how likely it is that the sample has come from distribution q.
It is given by
Expanding KL Divergence gives
The First term in the above derivation is referred to as cross Entropy. Represented as H(p,q).
Cross Entropy of p , q can be summarized as sum of Entropy(Uncertainty) present in p and how likely is a sample of p could have been generated from distribution q.
So when we use a Binary Cross Entropy Loss over Training Sample of N , we represent it as