Cross-Entropy but not without Entropy and KL-Divergence

Arpit Garg
CodeX
Published in
5 min readApr 3, 2022
Photo by ThisisEngineering RAEng on Unsplash

When playing with Machine / Deep Learning problems, loss/cost functions are used to ensure the model is getting better as it is being trained. The goal is to keep the loss function as small as possible. Better the model is if the loss is as small as possible. When trying to classify images/text/audio etc., the Cross-Entropy loss is the most popular cost function. It is used to make classification models better. Here, we will discuss Entropy, cross-entropy, KL Divergence, and how it works in this article.

TLDR: Cross-Entropy is the difference between two probability distributions for a particular random variable or a collection of events.

Terminologies required:

  1. Probability Distribution

A probability distribution is a mathematical function in probability theory and statistics that expresses the probabilities of occurrence of several possible outcomes for an experiment.

2. Random Variable

Informally, a random variable is one whose values rely on the outcome of a random phenomenon.

3. Entropy

Entropy is the process through which everything becomes increasingly random. It is invincible.

Before moving to Cross-Entropy, we must know what Entropy means:

Entropy, in its simplest form, is a measure of the average amount of information obtained from a single sample selected from a particular probability distribution.

Let's consider the weather prediction system; if we measure the weather entropy in the desert, it would always be close to 0, as it would always be sunny. If there's no variation, then Entropy would be close to 0 and vice-versa.

To calculate Entropy, we need some measures to calculate the information content. Information Theory by Claude Shanon solves this problem.

The information content is how likely it is that a specific event will happen from a random variable, and based on the entropy definition, we need a function that goes close to 0 when the probability is high (as there are no variations), and at the same time that goes close to 1 when the probability is low (as there is a high variation). We can consider the log function for calculating the information gained from any Event E as:

Information content

which is equivalent to:

Information content

Going back to the definition of Entropy, it's the average amount of information obtained from a single sample selected from a distribution. The above information content formulae changes to:

Entropy

In the case of Entropy, we always considered the same probability distribution (i.e. true distribution mostly). But cross-entropy handles the situation when we have both the predicted distribution and true distribution.

(Predicted distribution can be obtained from any function or model etc.)

Let's consider the true probability distribution as p and the predicted probability distribution as q. Our formula changes to cross-entropy formula as:

Cross-Entropy

Using the basic entropy definition here, the cross-entropy formula gives the average information obtained from predicted distribution q about the true distribution p.

When both the distributions of predicted and true are close that means cross-entropy goes close to 0. (that's what we want when we want to predict something, so it must be close to true values or distribution), otherwise, Cross-Entropy would be high. So we always try to reduce the cross-entropy as it would bring the true distribution and predicted distribution close.

If we consider the ideal case in which our true predicted distribution p is the same as true distribution q then the Cross-Entropy is equal to the Entropy.

Generally, in real-world scenarios, the distribution differs as true distribution can't be equal to the predicted distribution; in that case, Cross-Entropy is always larger than the Entropy by some values. This amount by which Cross-Entropy is larger than Entropy is called Relative Entropy or, in mathematical terms Kullback–Leibler divergence (KL Divergence).

So in simple terms:

Cross-Entropy = Entropy + KL Divergence

or KL Divergence is given by:

KL Divergence

Basically, KL divergence is the natural distance from true to predicted distribution.

KL divergence measures how different one probability distribution is from another. It shows how far apart they are. More specifically, it's the amount of information that we need to update from one distribution so that we can move from one to the other (Bayes' rule).

If we have predicted probability as p and true probability as q then over time we try to reduce the KL divergence so they could be similar,

Minimizing KLD over time

As a result, in classification problems, optimizing using the sum of cross-entropy overall training samples is equivalent to optimizing using the sum of KL divergence overall training samples; that's why Cross-Entropy is the optimal and most commonly used cost/loss function as it helps to bring the predicted and true distributions closer together.

For Wrapping up:

  1. If we have an event which is having a probability of 1 (approx), then it would not give any information so the Entropy would be close to 0.
  2. On the contrary, if we have an event with a probability close to 0 (approx), anything could happen, giving a lot of information so that the Entropy would be relative to 1.
  3. Extending the entropy concept if we think of supervised learning (or similar) when we know the true distribution and our models/functions give the predicted distribution, we can use cross-entropy to determine how much information our predicted distribution has is giving about the true distribution.
  4. When a predicted distribution has a high similarity with true distribution, then cross-entropy should be near 0; that's what we want; otherwise, cross-entropy would be high.
  5. We always try to reduce the Cross-Entropy as it would bring the true and predicted distribution close, but there's always a tiny gap between this.
  6. Ideally, when predicted and actual distributions are equal, Cross-Entropy and Entropy would be identical.
  7. In real-world cases, it's impossible to have identical distributions, and Cross-Entropy would be high by some amount; that amount is called KL Divergence.

--

--

Arpit Garg
CodeX
Writer for

Researcher @ AIML, Australia. GitHub: https://github.com/arpit2412. Always a student. Stay Tuned!!