# Cross-entropy and Maximum Likelihood Estimation

So, we are on our way to train our first neural network model for classification. We design our network depth, the activation function, set all the hyper-parameters, and then choose a loss function. As we have been taught, we use the cross-entropy loss function since it’s suitable for classification. At some point, we started to get curious, why do we use cross-entropy as the loss function?

To answer this, firstly we need to know about entropy. Entropy is a measure of the uncertainty of a random variable. If we have a random variable *X*, and we have probability mass function *p*(*x*) = Pr[*X=x*], we define the Entropy *H*(*X) *of the random variable *X *with

Now how can we know this value *H*(*X*) corresponds with the uncertainty of *X*? Imagine if there is one *x* that have a probability of 1. We should know for sure that the value of *X *should always be *x* and nothing else. If we put it in the equation (1), we have

where the first term in the second line equals to 0 by log(1) and the second term equals to 0 by p(*x*_hat) since the sum of p(*x*) should be 1. Notice that we use 0 log 0 = 0, which can be justified by x log x → 0 as x → 0 . Hence, if we are very certain that a random variable will have a value, that means the uncertainty will be very low. This notion is captured very well in the graph of H(*p*) vs *p *in Bernoulli distribution.

If we look closely at the definition of the entropy of a random variable, we can see that the entropy is related to the expectation of the random variable, that is we can write

So the entropy of a random variable is the expected value of the random variable log(1/*p*(*x*)) where *X* is drawn from *p*(*x*). Note that we can also denote entropy by *H*(*p*).

Now suppose that we have an unknown true distribution *p*(*x*), and we have modeled an approximation distribution *q*(*x*), the inefficiency of assuming that the true distribution is *q*(*x*), not *p*(*x*) can be measured with relative entropy or Kullback-Leibler distance. In other words, relative entropy is a measure of the distance between two distribution. Relative entropy, denoted *D*(*p*||*q*) is defined as

If we expand the log(*p*(*x*)/*q*(*x*)), we have

Where the second term on the right-hand side is the entropy of the distribution *p*(*x*), and the first term on the right-hand side is the cross-entropy. We can see that the cross-entropy is closely related to the relative entropy, and we can define the cross-entropy, denoted *H*(*p*,*q*) as

Okay, so that was cross-entropy, now how does it fit in with our model loss?

We need to go back to our understanding of one core principle in machine learning, i.e., Maximum Likelihood Estimation (MLE). Suppose that for a problem, we have a set of examples X_example = {*x*_1, *x*_2, …, *x*_m} that drawn independently from a true but unknown distribution p_data(x). Then we try to model the true distribution by parametric model p_model(*x*;𝜃) with 𝜃 as the parameter. We can say that p_model(*x*,𝜃) maps *x* to estimate the true but unknown p_data(*x*) given the same data. To get the best model, we need to find such 𝜃 that yields the most similar outcome of p_model(*x*,𝜃) to p_data(*x*). We can use MLE principle to find such 𝜃, that is by using maximum likelihood estimator for 𝜃, which defined as

Since each example *x*_i in X_example are independent of each other, then we can write 𝜃_ML as

The product of p_model(*x*_i;𝜃) is going to be very close to 0 when the example set size m is large since the probability will be in the range of 0≤p_model(*x*_i;𝜃)≤1. This could cause a serious underflow to the computer, resulting in a less precise estimation of the model. One way to avoid this problem is by alternatively compute the sum of the logarithm of p_model(*x*_i;𝜃).

This solves the underflow problem since the logarithm of the probability will become negative values instead of some number like 1e-30. We also turn the product into summation, resulting in more manageable computation. Since the arg max will be the same, equation 9 will result in the same parameter 𝜃_ML as equation 8, while having the above advantages. Now, since the arg max also does not change when we scale the log probability, we can write

and still yields the same 𝜃_ML as equation 8 and 9.

Equation 10 shows the relation of cross entropy and maximum likelihood estimation principle, that is if we take p_example(*x*) as p(*x*) and p_model(*x*;𝜃) as q(*x*), we can write equation 10 as

We are familiar with the last term of the equation since it is the cross-entropy defined in equation 6. This shows that we can see the problem of getting the best parameter 𝜃_ML using maximum likelihood estimation as minimizing the cross-entropy between our parametric model p_model(*x*;𝜃) and the empirical example distribution p_example(*x*).

This thought process shows that it is sensible to train our model by minimizing the cross-entropy loss since it can lead us to the maximum likelihood estimator of the parameter 𝜃_ML that yields the best model according to the training example.

## References:

- T. M. Cover, J. A. Thomas, “Elements of Information Theory, Second Edition” (2006)
- I. Goodfellow, Y. Bengio, A. Courville, “Deep Learning” (2015)
- S. Shalev-Shwartz, S. Ben-David, “Understanding Machine Learning, From Theory to Algorithms” (2014)