A Family dinner over Maximum Likelihood Estimation
Maximum Likelihood estimation or more popularly known as MLE, is an important concept but unfortunately is largely ignored. This article aims to develop an intuition about MLE and it’s application in machine learning.
What is MLE? Simply put, MLE is a parameter estimation approach. Maximum likelihood estimation is a method for estimating model parameters from the observed data in a way that it maximize the probability of obtaining the observed data. That’s It. Notice that MLE doesn’t try to find the model, it only tries to find the parameters of a model. This is because, before applying MLE, it is expected that you know what process generated this data & you can develop a mathematical model for this process, MLE is then used to estimate the parameters of this model. At this point however, let’s forget everything and enjoy this dinner conversation about MLE at a professor’s house.
Grandmother: Can you explain me what is MLE? I keep hearing a lot about it.
Professor: MLE is simply an approach to make inferences about the population using available data at hand. Let me give you an example: Suppose a mining tycoon is interested in buying a gold mine, but in order to determine the price he should pay for this mine, he wants to know what is the percentage of gold ores among rocks in this mine. He sends three persons to pick what they think are gold ores. All three of them collected 5 ores each. Upon actual inspection (whether it’s a rock or gold ore), it turned out that all the 5 picks of first person are ores, 2 picks of second person are rocks and all the 5 picks of third person are rocks (poor guy!). They, then need to report back the quality of mine based on their observation. What do you think each person would report? Using MLE, first person would say there are no rocks in this mine, second person would say mine contains 40% rocks while the third person would say it’s a scam, mine has only rocks in it. So, you see, each person made assumptions about the population, in a way that it best explains the data available to him.
Mother: Okay! so in the previous example, you simply computed the sample measure and since you can’t see the entire population, you call this as population parameter. But I though, inferences about the population are made using hypothesis testing?
Professor: If you don’t assume any underlying parametric model for the population, and are only interested in a fixed measure, then hypothesis testing alone is sufficient to make inferences about population. But, if you think your population follows a parametric model, and you are interested to know the parameters of this model, then you would need both MLE & hypothesis testing. Both these techniques work together to make inferences about the population. MLE is used to estimate the model parameters and hypothesis testing is used to confirm whether these estimates actually hold true on the population or not.
Let’s take an example: Suppose you want to install cooling system in a furnace, so you can always maintain the furnace temperature at a certain level. For this, you need to know how the inside temperature of furnace varies with time. So, you put a device and took hourly readings of temperature for two days. Now, based on the industry research, you already know the furnace temperature follows a Gaussian distribution. But to define a Gaussian distribution you need two parameters. The mean, μ, and the standard deviation, σ. Different values of these parameters would result in different curves. One way to proceed is, you assume different values of μ and σ, draw different curves, and compare the probabilities of obtaining the data points you have. The curve under which these probabilities are maximum is the best estimate of the model. Once you have the estimates of μ and σ, you use hypothesis test to confirm whether these estimates can be generalized or not.
Wife: This doesn’t sound practical, assuming different curves and comparing the probabilities for each of them. There has to be a more scientific approach.
Professor: Yes, there is. In actual practice, a mathematical function is created which represents the data at hand and this function is maximized to estimate the value of population parameters. In our example, for Gaussian distribution, we would write the Gaussian probability distribution function for each of the data points & then multiply them together (assuming each data point is independent) to get the final function, this function will then be solved for maximum value to obtain the MLE estimates.
Wife: You are talking about maximizing the probability, however MLE works on maximizing the likelihood. What is the connection between these two?
Professor: For all practical purposes and even mathematically, likelihood and probability are same, that’s why I am using the word probability, rather than likelihood in all the previous instances.However, a fine difference does exist between these two for statisticians. Take a look at the below equation:
P(data; μ, σ) means “the probability of observing the data with model parameters μ and σ”. Whereas, L(μ, σ; data) means “the likelihood of the parameters μ and σ taking certain values with the data we have.” When we know the model parameters and are using them to calculate individual probabilities, it’s called probability function and when we already have data points and we calculate model parameters using these, it’s called likelihood function. For all practical purposes, you can think of likelihood as probability, if it makes it easier. Mathematically they both are same, they only use different notations.
Wife: Okay, but I read on many blogs, Gradient descent is also a popular technique for parameter estimation. What is the connection between these two?
Professor: Gradient descent is an optimization algorithm. Similar to how MLE & hypothesis testing work in conjunction to make inferences about the population, MLE & optimization, together work on parameter estimation. They both have different jobs to perform. MLE is an approach which defines the problem, optimization is the engine which solves the problem. The role of MLE is to define the objective function which needs to be maximized. Once the objective function is defined, MLE job is over. Optimization algorithms are then used to solve this objective function to estimate parameter values. Once these parameter values are estimated, they are further subjected to hypothesis tests, to decide if these values can be generalized to population or not.
Daughter: I am able to connect all the dots. Basically, MLE is used to develop a mathematical function to estimate the model parameters, optimization techniques are used to solve this function and hypothesis tests are used to confirm whether these estimates hold true on the population or not.But it’s all theory, can you explain these in the context of an actual machine learning problem?
Professor: Let’s take the case of logistic regression algorithm. In logistic regression, it is assumed that sigmoid function represents the probability of obtaining target class for a given feature values. Hence, sigmoid function is used to calculate the individual probabilities of each observation in the training set. All the observations in training set are assumed independent of each other, so these probabilities are multiplied together. This function is then set for maximization. This process up to here is based on MLE approach.
Then comes the optimization part. A natural logarithm of this function is then taken, to make partial derivative computations easier. This function is called Log-likelihood function. This maximization problem of log-likelihood is then multiplied by -1, to convert it into a minimization problem. This gives us negative log likelihood, which is the cost function for logistic regression. This function is then solved, using the Gradient descent algorithm.
Once the negative log-likelihood function is solved, we obtain the estimates of coefficients of each feature in the model. These are the MLE estimates. Whether these estimates are significant ,is tested by using Wald Hypothesis test, while the overall fit of the model is evaluated using Chi-square.
This ends the long discussion. If interested, you can refer to my Hypothesis testing article here.
Originally published at https://www.linkedin.com.