Image from **https://statistics-analytics.uark.edu/**

Maximum Likelihood for the Normal Distribution

11 min readApr 24, 2020

Let’s start with the equation for the normal distribution or normal curve

It has two parameters the first parameter, the Greek character μ (mu) determines the location of the normal distribution’s mean.

a) A smaller value for μ moves the mean of the distribution to the left.

b) A larger value for μ moves the mean of the distribution to the right.

the Greek characters μ and σ

The second parameter the Greek character σ (sigma)is the standard deviation. and determines the normal distribution’s width

a) A larger value for σ makes the normal curve shorter and wider

b) A smaller value for σ makes the normal curve taller and narrower

We’re going to use the likelihood of the normal distribution to find the optimal parameters for μ the mean and σ the standard deviation,

given some data x.

Let’s start with the simplest data set of all: a single measurement.

The goal of this super simple example is to convey the basic concepts of how to find the maximum likelihood estimates for μ and σ

Here we’ve measured a Light Bulb and it weighs 32 grams.

Now just to see what happens…

We can overlay a normal distribution with μ= 28 and σ =2 onto the data

and then plug the numbers into this equation

The likelihood of the curve with μ =28 and σ =2, given the data is 0.03

Now we can shift the distribution a little bit to the right by setting μ = 30 and then calculate the likelihood

Again we just plug the numbers into the likelihood function:

If we decide to fix σ equals 2 so that it is a given just like the data then we can plug in a whole bunch of values for μ and see which one gives the maximum likelihood

For example, if we start with the mean of the distribution over here on the left at 20 grams.

and we get a very very small likelihood equal to 0.0000000003

this case the slope equals zero when μ = 32

Now we can fix μ = 32 and treat it like a given just like the data.

And we can plug in different values for σ to find the one that gets the maximum likelihood

Note: You actually need more than one measurement to find the optimal value for σ

If we had more data then we could plot the likelihoods for different values of σ and the maximum likelihood estimate for σ would be at the peak, where a slope of the curve equals zero

To solve for the maximum likelihood estimate for μ we treat σ like it’s a constant and then find where the slope of its likelihood function is 0.

And to solve for the maximum likelihood estimate for σ we treat μ like it’s a constant and then find where the slope of its likelihood function is 0.

The example with one measurement kept the math simple, but now I think we’re ready to dive in a little deeper

So let’s use a two sample data set to calculate the likelihood of a normal distribution

To keep track of things, let’s call the first bulb that weighs 32 grams X_1

And the second bulb that weighs 34 grams X _2

*32 grams* **X_1** and *34 grams* **X _2**

We’ve already seen how to calculate the likelihood for this curve given X_1, the Light Bulb that weighs 32 grams and we can calculate the likelihood for the curve given X_2 by plugging in 34 into this likelihood function

but what’s the likelihood of this normal curve given both X sub 1 and X sub 2

These measurements are independent (i.e. weighing X_1 did not have an effect on weighing X_2)

So we just plug in the numbers and do the math

And that gives us a really small number:

If we had a third data point then we just add it to the given side of the overall likelihood an

With n data points

We add all n data points to the given side of the overall likelihood function

Then multiply together all n individual likelihood functions.

Now that we know how to calculate the likelihood of a normal distribution when we have more than one measurement.

We just multiply together the individual likelihoods.

Let’s solve for the maximum likelihood estimates for μ and σ

Here’s the likelihood function without any value specified for μ and σ

It equals the product of the likelihood functions for the N individual measurements

and here’s what the equation looks like:

What we need to do is take two different derivatives of this equation:

One derivative will be with respect to μ. When we treat σ like it’s a constant and we can find the maximum likelihood estimate for μ by finding where this derivative equals zero

the other derivative will be with respect to σ when we treat μ like it’s a constant

And we can find the maximum likelihood estimate for σ by finding where this derivative equals zero, before we try to take any derivatives, let’s take the log of the likelihood function:

We do this because it makes taking the derivative way way easier

In the likelihood function and the log of the likelihood function both peak at the same values for μ and σ.

Now we’re going to go, step by step, through all of the transformations that the log has on this function

First the log transforms the multiplication

into addition:

Let’s focus on this one first

Convert the multiplication into addition

Convert 1 over the square root into the exponent -1/2

in the right side, convert the exponent into multiplication:

Back to the above equation the -1/2 exponent into multiplication

Putting everything together:

Summarizing:

For the term in the middle I used the same rule, the exponent 2 was was added as a multiplication

And by following the same steps, we can transform the remaining parts of the sum:

Into:

Just to be clear about how we simplify, keep in mind that since we have n data points that means we have a term for the first data point, X sub 1and that this represents the terms for the remaining n minus 1 data points.

Then all n of the negative log of σ’s can be combined

and the last parts of each term stay the same.

This is the log of the likelihood function after simplification, and it is what we will take the derivative of:

So, let’s move it to the top for reference:

We’ll start by taking the derivative with respect to μ

This derivative is the slope function for the log of the likelihood curve and we’ll use it to find the peak.

The first term doesn’t contain μ, so it’s derivative is 0, the second term doesn’t contain μ either, so it’s derivative is also 0.

The third term contains μ, so now we have to work, specifically, the numerator contains μ and we have to apply.

We can use the chain rule, remember the derivative is with respect to μ ( σ is a constant and, thus, the denominator doesn’t change)

We can use the same logic to the remaining terms and get

we can pull the σ squared out and add the numerators together and combining the measurements and the μ’s

Now, let’s take the derivative of the log-likelihood function with respect to σ.

This derivative is the slope function for the log of the likelihood curve, and we’ll use it to find the peak.

So, from here on out, because they peak at the same spot I’ll show you the likelihood functions instead of the log-likelihood functions

Recall

The first term doesn’t contain σ, so it’s derivative is zero, the derivative of the second term is just n over σ.

The derivative of the third term isn’t tricky but it’s easier to figure out when we rewrite 1 over σ squared

We can use the same logic to the remaining terms and get the derivative of the log likelihood function with respect to σ

Simplifying:

To find the maximum likelihood estimate for μ, we need to solve for where the derivative with respect to μ=0 because the slope is zero at the peak of the curve:

Likewise to find the maximum likelihood estimate for σ, we need to solve for where the derivative with respect to σ=0

Setting the derivative with respect μ to 0 and solve for μ.

We start by multiplying both sides by σ squared, that makes the σ squared go away:

Then we add n times μ to both sides,

divide both sides by n and solve:

The maximum likelihood estimate for μ is the mean of the measurements.

So that is where the center of our normal curve will go

Now we need to set the derivative with respect to σ to 0

Now multiply both sides by σ

Add n to both sides and multiplying both sides by σ squared

Divide both sides by n

and take the square root of both sides and at long last:

We see that the maximum likelihood estimate for σ is the standard deviation of the measurements

We use the formula for the standard deviation to determine the width of the normal curve that given the data

In Summary the mean of the data is the maximum likelihood estimate for where the center of the normal distribution should go and the standard deviation of the data is the maximum likelihood estimate of how wide the normal curve should be.

References:

Please subscribe to his channel:

Maximum Likelihood for the Normal Distribution

Now just to see what happens…

but what’s the likelihood of this normal curve given both X sub 1 and X sub 2

Written by Lorenzo Castagno