Maximum Likelihood Estimation
Maximum Likelihood Estimation (MSE) is a method of estimating the parameters of a statistical model given some data.
Suppose we have x samples from independent and identically distributed observations, coming from an unknown probability density function. f(x|theta), where theta is unknown.
Properties
- Joint Density Function,

2. Likelihood, x samples are fixed “parameters” and theta will be the function’s variable

3. Log-Likelihood, easier to deal with log

The properties above are building blocks to get an estimation for our parameter value theta. By fixing our x sample data as parameters, and theta as a free variable, we are able to find the maximum likelihood by getting the partial derivative respect to theta.
Let’s say we have an interesting coin we tossed 80 times and landed on head 49 times. We currently don’t know what the odds of head appearing is for this interesting coin.
Probability density function will be Bernoulli trials and the joint density function, and likelihood will be Binomial Distributions with an unknown parameter.

We could try plugging in different values for p and see which one gives us the maximum likelihood. (e.g. L = 0.012 when p = 1/2, L = 0.054 when p = 2/3).
Smarter way would be to differentiate the likelihood function with respect to p and setting it to zero. (0 for differentiated function indicate local minima or maxima)

We can repeat the above steps for a Gaussian Distribution as well.
Probability Density Function

Joint Density Function / Likelihood


Log-Likelihood

Maximum Log-Likelihood, for two different parameters


Parameter Estimation


As seen with the Binomial and Normal Distribution cases, the parameters relies heavily on our x, “fixed parameters.”
How could we use this information in building classifier or regression model?
Classifier

Our problem shows data with three classes of red, blue, and green. Our independent variables are x and y and we assume these distributions to be Gaussian. With notes above, given our x and y data we can estimate parameters for each classes.
Let’s try to predict for this new data. How would we use the above estimated parameters to classifier this new data?


This is done through calculating three different parameter value sets of red, blue, and green. Probability for the above case being blue is the highest. We could also use softmax to squash these probabilities to sum up to 1.
