Maximum Likelihood For Dummies

Published in

J&T Tech

5 min readApr 1, 2022

Welcome! For the data science enthusiast and layperson alike, this is the first post in a series of posts on Statistical Learning. The emphasis here is on concepts, as well as some snippets of code where necessary. Let’s jump into the first topic: Maximum Likelihood Estimation.

Background and Motivation

In data science and machine learning, a central problem is the ability to find a parameterization of the sampling distribution of our data. In other words, the dataset is drawn from some unknown distribution, parameterized by an unknown parameter:

Data points drawn from unknown distribution

We will discuss why this is an important problem, the procedures for solving it, and some connections to powerful machine learning algorithms such as linear regression.

As an example, imagine you were flipping a special coin, with an unknown probability of getting Heads or Tails. In other words, we have the following parameterization:

From our experiences, we expect the parameter to equal 1/2, which means we have equal chance of getting heads or tails (a “fair” coin). However, what if you flipped this magical coin 5 times and observed the following sequence: HHHHT. Would you expect this with an equal probability coin? If you kept flipping this coin and generating more observations, and still saw a disproportionate amount of heads, you would probably start to believe the coin is not fair. In fact, you would believe that P(H) is a lot higher than P(T). This is the core idea behind inferring a parameter from our data observations, where the data is sampled according to some underlying distribution. We would say the likelihood is not high of observing HHHHT. Likelihood can be described as the probability of observing the dataset given a parameterization:

Likelihood Function

The goal of maximum likelihood estimation (MLE) is then to estimate the value of the parameter as the value that maximizes the probability (likelihood) of our data.

Problem Solving

Of course, this idea of likelihood extends beyond simple coin flipping. It actually gives us a problem solving procedure for distribution parameter estimation. Let’s outline the process in general, then walk through some examples for a typical Normal distribution, and the coin flipping problem mentioned above:

Write down the likelihood function. This means the probability distribution of the data given the parameter. A common assumption is that each data point is generated independently, so that we can factor the likelihood into:

Let’s walk through an example with each of these steps. Let’s use MLE to estimate the mean:

Likelihood function for Normal distribution

2. Next, we actually will simplify this likelihood function by taking the logarithm, which generates the log-likelihood function. Remember, the goal is to maximize the likelihood function. Since logarithm is monotonically increasing, if we maximize log of the likelihood, then this automatically maximizes the likelihood (since f(b) > f(a) only when b>a). In practice, we almost always skip directly to maximizing the log-likelihood as a proxy for maximizing the actual likelihood, since they are equivalent problems.

Example: Let’s calculate the log likelihood for the normal distribution example we started earlier:

3. Finally, we remember the good old days in calculus where we have a function and want to find the maximum. This requires taking the derivative and finding where it equals zero. As a note, we could also utilize any numerical optimization technique such as gradient descent. Either way, we have to actually calculate the derivative. The resulting argmax of the optimization problem returns the parameter that maximizes the likelihood.

Example: Let’s continue the Normal distribution example by solving for the optimal mean parameter:

So, we derived the maximum likelihood estimate for the mean value parameter, and guess what? It turns out to be the empirical mean! It verifies the logic that if we want to place a Gaussian around our data, the most likely Gaussian is centered on the actual mean of the data.

Coin Flipping Example

Let’s go back to this example of the coin flip sequence. What value of P(H), which is our unknown parameter, makes the sequence HHHHT the most likely? Let’s follow the procedure:

So, the maximum likelihood estimate for P(H) is 4/5. This should not be surprising, because our data sequence was HHHHT, so the ratio of heads was 4/5. In other words, if we choose P(H) as 4/5, then we are most likely to produce the data sequence HHHHT if we flip the coin 5 times.

Connection to Regression Problems

Notice that during the example for a normally distributed dataset we came across something interesting. The maximum likelihood estimation came down to optimizing a log-likelihood function that looked an awful lot like a least-squares error function. This is actually by design! The least squares error function for a linear model is actually the maximum likelihood estimator for the coefficients in the linear regression model. To see this, recall the main assumption of linear regression, which is that the regression labels y are generated according to a linear function plus Gaussian noise:

So, the MLE problem on this generating distribution actually is equivalent to minimizing the square error. This was one of the earliest applications of MLE in practice. Similar results hold for logistic regression, where the MLE of the logistic coefficients is equal to a minimizer of the logistic loss function.

Conclusion

We have walked through this powerful statistical parameter estimation tool and outlined the general procedure. It has many uses in machine learning, such as formulation of many regression problems. It’s worth noting that there are times when this MLE procedure fails, such as when the log-likelihood isn’t convex or differentiable. This happens in mixture models where a new procedure called Expectation-Maximization has to be used, and we will expand on that in a continuation of this series on statistical learning.