Gaussian Distribution and Maximum Likelihood Estimate Method (Step-by-Step)

Anel Music
The Startup
Published in
7 min readJun 11, 2020

Aim of this article:

  • Understand 1D Gaussian Distribution
  • Use the MLE-Method to determine the Gaussian model parameters

1. What is a Gaussian Distribution?:

When we use the term Gaussian distribution, also known as Normal distribution, we think of data that looks like this:

Histogram of normally distributed data. The Bell curve is shown in red.

If something looks like it is “Gaussian” or normally distributed, we typically think of data that is symmetric about it’s mean and can be described by a characteristic bell curve. This bell curve is defined by the Gaussian function, often simply called Gaussian. It turns out that the Gaussian is pretty simple as it can be described using only two parameters, namely the mean μ and variance σ²:

Gaussian function

1.2. Standard Normal Distribution:

If we set the mean μ = 0 and the variance σ² =1 we get the so-called Standard Normal Distribution:

Standard Normal Distribution

Now, you may wonder how these two parameters influence the shape of the Gaussian. Let’s look at some examples.

1.3. Changing the mean μ:

As described above, a Gaussian distribution is symmetric about it’s mean. If the mean is positive, the data is shifted to the right, and if the mean is negative, the data is shifted to the left.

1.3.1 Positive mean:

Gaussian with μ = 1 and σ² =1
Gaussian with μ = 1 and σ² =1

1.3.2 Negative mean:

Gaussian with μ = -and σ² =1
Gaussian with μ = -1 and σ² =1

1.4. Changing the variance σ²:

The variance describes how wide the data is spread. Distributions with a higher variance spread more and have a lower peak, whereas distributions with a lower variance are concentrated mostly around the mean and have a higher peak.

1.4.1 Positive mean:

Gaussian with μ = 0 and σ² = 2
Gaussian with μ = 0 and σ² = 2

1.4.1 Positive mean:

Gaussian with μ = 0 and σ² = 0.7
Gaussian with μ = 0 and σ² = 0.7

1.5 Time to get more practical:

1.5.1 Ball Color Example:

Let’s suppose that, for some tracking purposes, we want to model the color of a yellow ball.

If we apply an algorithm to segment the yellow ball, we can plot the pixel values histogram:

Illustration of ball segmentation and resulting pixel value distribution

As we can see, the data looks “Gaussian” and seems to have it’s mean around 54. The remaining question is how to determine the model parameters for the Gaussian that is described by the data. Fortunately, there is a method that can determine the parameters of a probability distribution called Maximum-Likelihood-Estimate or simply MLE.

1.5.2 Maximum-Likelihood-Estimate:

Our objective is to determine the model parameters of the ball color distribution, namely μ and σ². We can do that by maximizing the probability of our data x, given the model parameters μ,σ², often referred to as Likelihood. This means that we are interested in obtaining the parameters of our model that maximize the Likelihood given of a given set of observations. In our case, the observations are the yellow ball color pixel values.

If we express this mathematically we can write this objective as:

Objective

Assuming the independence of observations, which means that the probability that a certain pixel value occurs is independent of the other pixel values, we can express the Joint Likelihood (Likelihood of all pixel values occurring together at the same time) as the product of the individual pixel likelihoods:

where N is the number of pixel values or measurements or observations.

However, we can simplify this objective even further. One simple approach is to use the Log-Likelihood instead of the Likelihood. At this point, you probably want to know why we should use the logarithmic function. It turns out that there are three main reasons:

Log-Properties:

  • 1. Log turns products into sums, which is often easier to handle
Product rule for Log functions
Quotient rule for Log functions
  • 2. Log is concave, which means ln(x) is strictly increasing and has only one global maxima
  • 3. If x* is the maximum value of a given set x, then ln(x*) is also the maximum value

Using property 1. from above, we can rewrite our objectivefunction from a product of Likelihoods to a sum of Log-Likelihoods:

Keep in mind that due to property 3, maximizing the Log-Likelihood is the same as maximizing the Likelihood. Therefore the parameter values that maximize the Likelihood are the same parameters that maximize the Log-Likelihood.

Remember:

which gives the following term:

Now, we can use property 1 to simplify the expression inside the ln (…):

Finally, we can remove the constant terms that are not influenced by our model parameters μ,σ:

There is just one last thing that we can change to use the standard notation which is turning the maximization problem into a minimization problem by changing the sign:

Substituting the objective function as:

gives us the final statement :

μ = 0 and the variance σ

The is still one remaining question. How to find the parameter μ, σ? In other words, how to solve the minimization problem. If we use the optimality condition for convex optimization, which says that the slope at the minimum is zero, we can determine μ, σ analytically by calculating the partial derivatives and setting them to zero like so :

1.5.3 Partial derivative with respect to μ:

For educational purposes I will provide the full derivation of the partial derivative for μ:

1.5.3 Partial derivative with respect to σ:

For σ, I will only provide the final formula. However, the main idea remains the same. First, determine the partial derivative with respect to σ, set it to zero and solve the equation for σ:

1.5 Final words:

If your data looks normally distributed you can model it using a Gaussian. A Gaussian is simple as it has only two parameters μ and σ. To determine these two parameters we use the Maximum-Likelihood Estimate method. This method estimates the parameters of a model given some data. So, we have the data, what we are looking for are the parameters that are likely to produce the data. It turns out that using the log function further simplifies the problem. The multiplication of likelihoods turns into a summation of log-likelihoods. By changing the sign of the equation we can turn the maximization problem into a minimization problem. Solving the minimization problem means that we must determine the partial derivatives with respect to each model parameters which our case are μ and σ. Once we determine the partial derivatives we can set them to zero and solve for μ and σ respectively, which will give us the both formulas.

I hope you learned something new and if so, please leave a like.

Cheers

--

--

Anel Music
The Startup

@Accenture Machine Learning Engineer | TUM MSc EE