Introduction to Maximum Likelihood Estimate

M S SASIDHAR
7 min readOct 23, 2018

--

When i heard the Term MLE(Maximum Likelihood Estimate) in Regression,it didn’t make any sense.I naively thought it as nothing but the probability function.But after lot of google search,realized that likelihood and probability functions are completely different to each other.So I wanted to share my understanding of MLE in the simplistic Manner and clarify it for those who still think that it is same as the probability function.Let us dive into it!

INTRODUCTION:-

Imagine that now you want to prepare a milkshake that you liked most last week in a restaurant. You get the milk, sugar, ice-cream, nuts etc.. And start pouring them in proportions in a juicer and try to vary them until it tastes exactly the same the way you wanted!

Remember the childhood days when you eagerly waited for your favorite song to be played out in the radio?So how did you listen to that?You rotate the tuner slowly until the song is perfectly heard (without noise) exactly the way you wanted.

Here in all the above cases you are trying to adjust few things in order to obtain the desirable outcome which you imagined (or) experienced. And adjustments are done till you obtain the desired taste and sound as in the above examples respectively.

Hurrah!You have been applying the Maximum Likelihood Estimate all your life without even consciously knowing it!Not convinced!OK…Let’s go into the details!

Mathematical and statistical Aspects:

First let us discuss the probabilities and conditional probabilities and probability distributions before we get into the Maximum Likelihood as we use them extensively in MLE.

Note:-It is just a very short introduction to the statistics,you can always refer to various resources to learn more.

Probability refers to the chance of something to happen.

Example:-Coin toss, probability of head is 0.5 or 50%.assume that coin is fair in this case.

Notation:-P(H)=0.5

Conditional Probability refers to the chance of something to happen given that some other has happened…Sounds confusing?

Example:-assume that you already know that you have fever, now the probability that is going to be viral fever is 0.3. This is conditional probability.

Notation:-P (Viral fever|fever) =0.3

Random Variable(X) is a variable whose possible values are numerical outcomes of a random phenomenon.

Example:-In a single Coin toss, Random Variable is obtaining head or tail.

X=1 If head occurs

X=0 if tail occurs

(OR) vice-versa based on the importance for head or tail.

Probability Distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment.

It can be discrete (or) Continuous depending on the values that the random variable can take.

Popular Discrete probability Distributions:-

· Binomial distribution.

· Geometric Distribution.

· Hyper-geometric distribution.

. Negative binomial distribution.

· Poisson distribution.

Popular Continuous probability Distributions:-

· Normal Distribution

· The Gamma distribution

· The Exponential distribution

· The beta distribution

Let us understand the below case.

In Coin toss experiment, when you repeat the experiment twice you get four possibilities.

Random Variable(X) can take one of {HH, HT, TH, and TT}.

H=Head,T=Tail.

Now assume that you are interested in finding number of heads and their probabilities.

X refers to the random variable which is nothing but number of heads in this case.

Probability of no head to occur P(TT) =1/4 →P(X=0)

Probability of 1 head to occur P(HT (or) TH) =1/2 →P(X=1)

Probability of 2 head to occur P(HH) =1/4 →P(X=2)

Probability Distribution of different outcomes of the coin toss experiment

The above chart shows the probability distribution.

Congrats!! You have just learnt Binomial Distribution!!

Binomial Distribution:-

When a trail of two outcomes (as success and fail) is repeated for n times and when the probabilities of number of success event are logged, the resultant distribution is called a binomial distribution. For an example lets toss a coin for 10 times (n = 10) and the success is getting head. So if we log the probabilities of getting head only one times, two times, three times, then that distribution of probabilities is in a binomial distribution.

The Binomial Distribution Formula

binomial distribution function

Where
b= binomial probability
x = total number of “successes” (pass or fail, heads or tails etc.)
P = probability of a success on an individual trial
n = number of trials

binomial distribution function

Note: This binomial distribution formula uses factorials. “q” in this formula is just the probability of failure (subtract your probability of success from 1).

Now as we are armed with enough knowledge, let us explore the MLE in detail.

MAXIMUM LIKELIHOOD ESTIMATE:-

As we have seen in the examples in the beginning of the text, where we adjusted few things in order to obtain results which matched our expectations which we already did have in our mind and they are subjective.

Now in Statistics, things we adjusted are parameters and subjective expectations that we had in the examples are Data.

Definition:-

Maximum likelihood estimation is a method that determines values for the parameters of a model. The parameter values are found such that they maximize the likelihood that the process described by the model produced the data that were actually observed.

Probability Vs Likelihood functions:-

So generally, likelihood expression is in the form of: L (parameters | data).Meaning of this is, “likelihood of having these parameters, once the data are these”.

Likelihood and Probability are two different things although they look and behave same. We talk about probability when we know the model parameters and when predicting a value from that model. So there we talk about how probable is the resultant value to be come out from that model. So probability is: P (data | parameters)

Now we can see that Likelihood is other side of probability. That is we are going to guess the model parameters from the data. So there we know the results well and we know for sure that they have occurred (probability = 1).

MAXIMUM LIKELIHOOD ESTIMATION FOR BINOMIAL DISTRIBUTION(Closed form solution):

In the Likelihood Function, let us assume that we know that k successes out of n trials and we need to find out the P such that which maximize the chances of getting k successes out of n trials.

Likelihood function for binomial distribution

In order to make derivations simpler we use log-likelihood which is nothing but the log transformation of the likelihood function and the differentiation of it.

In order to obtain the maximum likelihood of k for a given P is obtained by differentiating the equation with respect to P and equate it to Zero.

P gives the proportion of success (k/n) which is the parameter with maximum likelihood for k successes.

Finding the MLE parameter using Gradient-ascent optimization technique:

Gradient Ascent algorithm is used to find out the Maximum of a function.

The algorithm works in the following way.

For a random x we find the slope and if it is negative, in the next iteration the x value is decreased by the product of learning rate and the magnitude of slope. If the slope is positive, in the next iteration the x value is increased by the product of learning rate and the magnitude of slope. This process is continued till it reaches a slope Zero where it no more moves and finally it gives out the maximum.

Pseudo-code of gradient ascent algorithm:-

If f’(x)>0:

Move right

elif f’(x)<0:

Move left

Continue this process until f’(x) =0

Algorithm implementation Using simple while loop:

Gradient ascent Algorithm

Python implementation:

Assuming that n=10(number of trails) and k=6(number of success)

P=0.2 →random initialization

l=0.009 →learning rate

Count=0

y_diff=(6/P)-4/(1-P)

While abs(y_diff)>0.0005:

y_diff=(6/P)-4/(1-P)

P=P+l*y_diff

count+=1

The output of this algorithm converges at p=0.6.

Applications of this can be seen in the Linear regression parameter Estimation where we assume that at each point of X,Y values are normally distributed,so Maximum likelihood estimate(ordinary least squares method) will find out means at each point in order to explain the maximum variability in the data.you can do further reading here.

https://medium.com/quick-code/maximum-likelihood-estimation-for-regression-65f9c99f815d

Finally, we have learnt the MLE which is nothing but finding out parameters of a function given the data so that function can maximize the chance of the data to occur Just the way you tuned the radio tuner to listen to your favorite song with the knowledge of song (how it needs to be heard) in your mind!

Happy learning!!

--

--