The link between Maximum Likelihood Estimation(MLE)and Cross-Entropy

Published in

Intro to Artificial Intelligence

3 min readMay 29, 2020

Maximum Likelihood Estimation

Maximum Likelihood Estimation(MLE) is a method to solve the problem of density estimation to determine the probability distribution and parameters for a sample of observations[2]. In MLE, the estimate computed by maximising the likelihood function[2].

Then Maximum likelihood estimation considers this problem as an optimisation problem/search problem where it looks for parameters represented as θ that best fit for the joint probability of observed data or best explain the observed data[2].

Likelihood function that measures the goodness of fit of a statistical model to a sample of data for given values of the parameters θ. Likelihood function, as shown below:

Where M is the number of classes in the classification problem.

Above equation indicates that Likelihood function is calculated by multiplying the discrete probability distribution over the classes raise to the probability of ground truth that is one hot encoded(It has all the probability mass on a single class or label).

For example:

For mathematical convenience, instead of product, we can add logarithm that is an increasing function[3]. This addition converts the equation to log-likelihood, as shown below[4]:

source:[4]

Mathematically, it is easier to minimise the negative log-likelihood function than maximising the direct likelihood[1]. So the equation is modified as:

Cross-Entropy

For a multiclass classification problem, we use cross-entropy as a loss function. An intuitive explanation of cross-entropy is the average bits of information required to identify an event drawn from the estimated probability distribution f(x), rather than the true distribution g(x)[6]. The lower the bit higher the approximation of f(x). When we use the cross-entropy loss function, the idea of reducing the bit using the optimisation process.