Statistics from the ant’s perspective: What is maximum likelihood estimation?

James Dai
6 min readFeb 3, 2023

--

A method to estimate the parameters of a model

Introduction

Likelihood

Likelihood is a concept in statistics used to describe the probability of observing a particular set of data given a set of parameters for a specific model, such as a prediction model for disease phenotypes.

It is a function that assigns a probability to each set of parameter values, based on how well the model fits the observed data.

The likelihood function provides a way of comparing different models and different sets of parameter values to determine which is most likely to have generated the observed data. Therefore, it is a way to identify the optimal model for a predictive purpose.

For example, consider a coin-tossing experiment. The likelihood of observing a particular sequence of heads and tails (e.g., HTHHT) given the parameter of a fair coin (i.e., the probability of heads is 0.5) can be calculated by multiplying the probabilities of each outcome in the sequence. The likelihood of observing the data increases as the sequence becomes more consistent with the assumption of a fair coin.

Likelihood and probability both represent the probability of an event, and they can even be the same thing, such as the probability mass function, but they have completely different meanings in mathematics.

Probability refers to the prediction of the result of the next observation based on a fixed set of model parameters.

For example, using a statistical model that follows a normal distribution to predict the probability of someone’s weight being greater than 85. So when we talk about probability, we usually have prediction as the goal.

Image by author

Likelihood, on the other hand, is an estimate of the probability of a statistical model given some observed results. For example, if we observe that a person’s weight is 85, we can calculate how likely it is to occur under this statistical model through the probability density function.

Image by author

Looking at the process of updating statistical models, statistical models infer or predict the probability of a certain value based on a specific set of parameters, such as the probability of heads in a coin flip being 0.5, and the probability of getting heads twice in a row being 0.25. After collecting actual sample data, such as flipping a coin several times, we can calculate how likely it is for the coin flipping result to have occurred under this statistical model.

Different model parameters will result in different likelihoods, and the maximum likelihood estimate is to choose the set of parameters that has the highest likelihood.

The likelihood is given by the likelihood function: L(μ, σ) to get the estimate about how well a certain model fits the data of interest.

From the above explanation, you can also think of probability as “the probability of a data point occurring given a set of model parameters,” and likelihood as “the probability of a set of model parameters generating a set of sample data given the actual observed sample.” — — Probability used in different senses!

Likelihood-The likelihood that any parameter (or set of parameters) should have any assigned value (or set of values) is proportional to the probability that if this were so, the totality of observations should be that observed. by Ronald Fisher, 1922

Mathematics of likelihood function

Based on the definition by Ronald Fisher, we can write the mathematical expression:

That said, the likelihood based on a set of parameters (i.e. u and s here) is proportional to the probability of observing all the samples (i.e. x1~xn) given u and s.

Imagine a normal distribution characterized by (u,s). The likelihood of the distribution to fit actual data is proportional to the probability of all data points based on (u,s). Thus, under optimal condition where a normal distribution based on (u,s) fits a set of data points well, higher proportion of samples with values around the mean would result in overall higher probability of observing all data points.

In the sense of machine learning

Consider a linear regression model:

We want to predict the probability of surviving longer than two years in patients with cancer. Here we have three parameters that have to be optimized: θ1, θ2 and θ3.

Next, we have to find the maximum likelihood of these three parameters:

To make it further concise, we then modify this description to:

Then we add arg (argument) to clarify that we want to maximize the likelihood based on (θ1, θ2, θ3):

Look at what we have described above, we can get:

Importantly, because each variable is independent, we can re-write this equation:

Then:

from x1 to xn

And because of arithmetic underflow (a situation in numerical computing where the result of a calculation is a number that is too small to be represented in the computer’s memory or the numerical format being used. This can result in the loss of significant digits or a value being rounded to zero. In some cases, arithmetic underflow can also cause numeric instability or produce incorrect results), we take logarithm on both sides (also remember: log is monotonically increasing; if a>b then loga>logb):

from x1 to xn

Remember log of products is the sum of logs. It can be changed into:

from x1 to xn

Finally, combining the above notations, we can derive the equation:

Example of normal distribution

Assume we know the standard deviation of tumor size for hepatocellular tumors = 2.5 cm.

If we have randomly sampled 150 patients in my hospital, how do we derive the mean tumor size for all patients?

To identify the optimal mean value, we have to first consider the function of normal distribution and incorporate the concept of the maximum likelihood. For example, if we have three samples:

Is equivalent to:

To obtain the maximum value of F, we have to first logarithm this equation and calculate the derivatives of F relative to u:

This is because we attempt to obtain the maximum value. So, we have to make the above description equal to zero:

The result of differentiation tells us that the sample mean is actually the best estimate of the population mean! Quite intuitive!

Conclusion

Maximum likelihood estimation is a method for estimating model parameters. The goal is to find the model parameters that are most likely to produce the observed sample results through the real observation of sample information.

I appreciate every support if you like this article!

--

--

James Dai

A passionate writer. Somewhere in between data scientist, bioinformatician, oncologist and immunologist. DPhil at Oxford University