An Intuitive Guide To Cross Entropy

Understanding the role of cross entropy in ML applications

Aayush Agarwal
7 min readFeb 9, 2024

Prerequisite: The previous article in this series An Intuitive Guide To Entropy.

Recall that in the previous article, we discussed a coin-tossing scenario with two outcomes — heads (H) and tails (T) and their corresponding probabilities and ‘surprise’:

|  X   |  Pr(X) |  Surprise S(X)  |
|------|--------|-----------------|
| H | p | log(1/p) |
| T | 1 - p | log(1/(1-p)) |

We established that the inherent randomness or chaos in this system is called its entropy.

Entropy = - { p * log(p)   + (1-p) * log(1-p)}

Further, we discussed how entropy would vary for different values of p in the coin-toss example.

|   p   |  Entropy |
|-------|----------|
| 0 | 0 |
| 0.1 | 0.325 |
| 0.5 | 0.693 |
| 0.9 | 0.325 |
| 1 | 0 |
Entropy vs. p

Finally we developed an intuition for why entropy is highest when outcome probabilities are equally distributed (p = 0.5 in the coin-tossing example and p = 0.25 in the book location example). This is the most ‘chaotic’ scenario.

If any of this is not clear, I urge you to read An Intuitive Guide To Entropy before proceeding further.

In addition, note that this entropy or randomness is inherent to the system and does not depend on the observer. Since the outcome of the coin-toss is non-deterministic, we can never be sure what the outcome will be. So after every toss, we experience some surprise. This surprise is a consequence of the non-determinism of the coin toss and does not originate due to oversight on our end.

With these entropy concepts in mind, let us extend the example to develop a similar intuition for cross-entropy.

In reality, we rarely know the true probability distribution over all outcomes.

Imagine a scenario where we’re given a coin with Pr(H) = p = 0.9. However we do not know anything about this coin, at least at first. So we toss the coin a few times to learn about it. Let us use q to denote how likely we think H is. In other words, we believe the coin will show up heads with probability q. When the coin does show up heads, we register a surprise log(1/q). Consistent with our intuition, the less likely we thought H to be, the more surprise we will register when H actually does happen. Likewise, the probability and surprise values for tails (T) are 1-q and log(1/(1-q)).

One may reasonably start with the belief that this random coin given to us is fair, meaning q = 0.5.

Scenario #1: p = 0.9 | q = 0.5

|  X   | True probability  |  Our belief   | Our surprise |
|------|-------------------|---------------|--------------|
| H | 0.9 | 0.5 | log(1/0.5) |
| T | 0.1 | 0.5 | log(1/0.5) |
Average Surprise =     Pr(H)  *    S(H)   +  Pr(T)  *  S(T)
= p * log(1/q) + (1-p) * log(1/(1-q))
= - { p * log(q) + (1-p) * log(1-q) }

Let’s talk about the asymmetry. It may be confusing at first why the average surprise is -Σ pᵢ.log(qᵢ) and not -Σ qᵢ.log(qᵢ), like in the formula for entropy.

The amount of surprise we experience will depend on how likely we thought the outcome to be, so that is represented by q, not p. However just because we believed the coin would show up H half the time does not mean the coin really will. How often we actually experience the surprise from an H outcome (log(1/q)) will depend on the true nature of the coin and its chances of showing up H, represented by p, not q. Hence, the average surprise from all H outcomes is p * log(1/q). Likewise for tails (T) outcomes.

Given p = 0.9, q = 0.5,

Average surprise = - { p * log(q) + (1-p) * log(1-q)}
= - {0.9 * log(0.5) + 0.1 * log(0.5)}
= 0.693

This average surprise is called cross-entropy.

Now let us say we toss the coin 5 times, with outcomes of 4 H and 1 T. We may then reasonably revise our estimate to q = 4/5 = 0.8. We are basically saying that the coin seems to be more biased towards heads than we thought, and based on our experiments so far, we believe that on average 4 out of 5 coin tosses will result in H. The true nature of the coin has not changed, so p continues to be 0.9.

Scenario #2: p = 0.9 | q = 0.8

Average surprise = - {0.9 * log(0.8) + 0.1 * log(0.2)}
= 0.362

As we have gotten better at predicting the toss outcomes, the coin’s actual behavior is more closely aligned with what we expected it to be, so it stands to reason that it surprises us a little less now. We still underestimate how biased the coin is towards H, but we do so a little less.

Suppose we continue tossing the coin 20 times, and get a total of 19 H and 1 T. Now that we have more data, we’ll want to revise our estimate q = 19/20 = 0.95.

Scenario #3: p = 0.9 | q = 0.95

Average surprise = - {0.9 * log(0.95) + 0.1 * log(0.95)}
= 0.346

By now we have started to form an intuition for what’s happening here. As we have a more accurate idea of how the coin behaves, it surprises us less than before. Interestingly, this time we have overestimated the probability of H. But as we will see later, to minimize our average surprise, we must be accurate, minimizing both over- and under-estimation of the coin’s bias.

Imagine that after 10,000 tosses, we have 9012 H and 988 T outcomes. We have revised our estimate of the coin showing up heads to be 9,012 / 10,000 = 0.9012. We still overestimate the probability of H, but less than before.

Scenario #4: p = 0.9 | q = 0.9012

Average surprise = - {0.9 * log(0.9012) + 0.1 * log(0.0988)}
= 0.32509

As expected, the average surprise is the lowest we’ve seen so far, thanks to the experimental tosses helping us to approximate the true probability distribution with increasing accuracy.

The question must be asked — how good can our estimates get? How low can our average surprise be? What happens if we estimate q to be the same as p. Let us consider the ‘ideal’ scenario.

Scenario #5: p = 0.9 | q = 0.9

Average surprise = - { p  * log(q)   + (1-p) * log(1-q)}
= - {0.9 * log(0.9) + 0.1 * log(0.1)}
= 0.32508

When p and q are equal, this calculation is actually the same as entropy!

Entropy  = - { p  * log(p)   + (1-p) * log(1-p)}
= - {0.9 * log(0.9) + 0.1 * log(0.1)}
= 0.32508

To help form a complete picture, here is a plot showing cross-entropy for various values of q, keeping p fixed at 0.9.

And here’s one more, for p = 0.5.

Cross-entropy vs. q for p = 0.5

The takeaway here is that when we have the best possible estimate of the coin’s probability of showing up heads (H), we continue to experience some surprise, but this surprise equals the entropy of the coin. It stems from the inherent non-determinism in the coin toss. Outcomes of the coin toss will never be known in advance with certainty, so the best we can do is to estimate q = p and ensure we experience the least surprise possible. Under- and over-estimating the coin’s probability of showing up H will both cause us to experience more surprise than is necessary.

The degree to which our estimated probability distribution diverges from the true probability distribution is called Kullback–Leibler divergence (KL divergence).

KL (P || Q) = Cross-entropy (P, Q) - Entropy(P)
= - { p*log(q) + (1-p)*log(1-q)} - { p*log(p) + (1-p)*log(1-p)}
= - p*log(q/p) + (1-p)*log((1-q)/(1-p)))
= -Σ pᵢ.log(qᵢ/pᵢ)

When q = p (or more generally, pᵢ = qᵢ for all i), the probability distributions are identical and KL divergence is zero.

Applications in Machine Learning

Cross-entropy is the most popular choice of loss function for training machine learning models, including OpenAI’s ChatGPT and Google’s Gemini. Conceptually, our coin-tossing experiment is very similar to training even the most sophisticated ML models. Just how we improved our understanding of the coin’s fairness and lowered the average surprise over time, machine learning models also learn by lowering the cross-entropy loss over training data.

As training progresses, the models improve their probability estimates over the task’s outcomes. For example, a word prediction model learns that The soup is more likely to be followed by is tasty than garden phone. In doing so, it has brought its probability distributions closer to the truth and reduced its average surprise.

However, just as in the coin toss example, some uncertainty will remain. The soup could be followed by was cold (or Bowl Of Park Slope). There will never be getting around the inherent uncertainty a.k.a randomness a.k.a chaos a.k.a entropy of language (or life). All the models can do, is try their best to approximate q to p.

--

--