16 - MLE: Maximum Likelihood Estimation

Published in

The Startup

5 min readNov 29, 2020

Maximum Likelihood Estimation (MLE) is a tool we use in machine learning to achieve a very common goal. The goal is to create a statistical model which can perform some task on yet unseen data.

The task might be classification, regression, or something else, so the nature of the task does not define MLE. The defining characteristic of MLE is that it uses only existing data to estimate parameters of the model. This is in contrast to approaches which exploit prior knowledge besides existing data.

We have samples x1,… xn, and assume that given they come under distribution, associated with parameters of theta. We know the form of distribution, but we want to know the parameters of distribution.

For one dimension, estimate parameter with the assumption that the distribution is normal guassian variant. We need to know which guassian distribution is more likely represents the data.

Assuming that the data is independently sampled, now problem is Max of theta.

The probability of drawing value xi from the distribution f(x|θ) is f(xi|θ). The probability of drawing the following vector of two observations (x1,x2) from the distribution f(x|θ) is f(x1|θ)×f(x2|θ). We define the likelihood function of N draws (x1,x2,…xN) from a model or distribution f(x|θ) as L.

This product terms are converted to addition with the help of log, which gives you maximisation problem in terms of sum of log terms.

we want to find mu and sigma, by differentiating with respect to mu and equating to zero.

Same thing we can do for variant and differentiate with respect to sigma and equating to zero

Demo

Let us load the Libraries that is required for the use case:

Use case:

interested in estimating the number of billionaires in different countries.

The number of billionaires is integer-valued.

Hence we consider distributions that take values only in the nonnegative integers.

(This is one reason least squares regression is not the best tool for the present problem, since the dependent variable in linear regression is not restricted to integer values)

One integer distribution is the Poisson distribution, the probability mass function (pmf) of which is

We can plot the Poisson distribution over yy for different values of μμ as follows:

Notice that the Poisson distribution begins to resemble a normal distribution as the mean of y increases.

The dataset mle/fp.dta can be downloaded here or from its AER page.

Using a histogram, we can view the distribution of the number of billionaires per country, numbil0, in 2008

Conditional Distribution:

The dependent variable — the number of billionaires yi in country i — is modeled as a function of GDP per capita, population size, and year's membership in GATT and WTO.

Hence, the distribution of yi needs to be conditioned on the vector of explanatory variables xi.The standard formulation — the so-called poisson regression model — is as follows:

We can see that the distribution of yi is conditional on xi (μi is no longer constant).

Maximum Likelihood Estimation

In our model for number of billionaires, the conditional distribution contains 4 (k=4k=4) parameters that we need to estimate.

We will label our entire parameter vector as β

To estimate the model using MLE, we want to maximize the likelihood that our estimate β̂ is the true parameter β.

Intuitively, we want to find the β̂ that best fits our data.

First, we need to construct the likelihood function (β), which is similar to a joint probability density function.

Assume we have some data yi={y1,y2} and yi∼f(yi).

If y1 and y2are independent, the joint pmf of these data is f(y1,y2)=f(y1)⋅f(y2).

If yi follows a Poisson distribution with λ=7λ=7, we can visualize the joint pmf like so

Maximum Likelihood Estimation with `statsmodels`

We’ll use the Poisson regression model in statsmodels to obtain a richer output with standard errors, test values, and more.

statsmodels uses the same algorithm as above to find the maximum likelihood estimates.

Our output indicates that GDP per capita, population, and years of membership in the General Agreement on Tariffs and Trade (GATT) are positively related to the number of billionaires a country has, as expected.

To analyze our results by country, we can plot the difference between the predicted an actual values, then sort from highest to lowest and plot the first 15