Maximum Likelihood Estimation (MLE) for Machine Learning

Brijesh Singh
Nucleusbox
Published in
3 min readJun 18, 2020

In the Logistic Regression for Machine Learning using Python blog, I have introduced the basic idea of the logistic function. The likelihood, finding the best fit for the sigmoid curve.

We have discussed the cost function. And we also saw two way to of optimization cost function

  1. Closed-form solution
  2. Iterative form solution

And in the iterative method, we focus on the Gradient descent optimization method. (An Intuition Behind Gradient Descent using Python).

Now so in this section, we are going to introduce the Maximum Likelihood cost function. And we would like to maximize this cost function.

Maximum Likelihood Cost Function

There are two types of random variable.

  • Discrete
  • Continuous

The discrete variable can take a finite number. A discrete variable can separate. For example, in a coin toss experiment, only heads or tell will appear. If the dice toss only 1 to 6 value can appear.
A continuous variable example is the height of a man or a woman. Such as 5ft, 5.5ft, 6ft etc.

The random variable whose value determines by a probability distribution.

Let say you have N observation x1, x2, x3,…xN.

For example, each data point represents the height of the person. For these data points, we’ll assume that the data generation process described by a Gaussian (normal) distribution.

As we know for any Gaussian (Normal) distribution has a two-parameter. The mean μ, and the standard deviation σ. So if we minimize or maximize as per need, cost function. We will get the optimized μ and σ.

In the above example, Red curve is the best distribution for the cost function to maximize.

Since we choose Theta Red, so we want the probability should be high for this. We would like to maximize the probability of observation x1, x2, x3, xN, based on the higher probability of theta.

Now once we have this cost function define in terms of θ. In order to simplify we need to add some assumptions.

X1, X2, X3… XN is independent. Let say X1, X2, X3,…XN is a joint distribution which means the observation sample is random selection. With this random sampling, we can pick this as a product of the cost function.

General Steps

We choose a log to simplify the exponential terms into a linear form. So in general these three steps used.

  • Define the Cost function
  • Making the independent assumption
  • Taking the log to simplify

So let's follow all three steps for Gaussian distribution where θ is nothing but μ and σ.

Maximum Likelihood Estimation for Continuous Distributions

MLE technique finds the parameter that maximizes the likelihood of the observation. For example, in a normal (or Gaussian) distribution, the parameters are the mean μ and the standard deviation σ.

For example, we have the age of 1000 random people data, which normally distributed. There is a general thumb rule that nature follows the Gaussian distribution. The central limit theorem plays a gin role but only applies to the large dataset.

Read More

--

--

Brijesh Singh
Nucleusbox

Working at @Informatica. Master in Machine Learning & Artificial Intelligence (AI) from @LJMU. Love to work on AI research and application. (1+2+3+…~ = -1/12)