Maximum Likelihood Estimation: How it Works and Implementing in Python

Vivek Palaniappan
Engineer Quant
Published in
5 min readDec 15, 2018

Previously, I wrote an article about estimating distributions using nonparametric estimators, where I discussed the various methods of estimating statistical properties of data generated from an unknown distribution. This article covers a very powerful method of estimating parameters of a probability distribution given the data, called the Maximum Likelihood Estimator.

This article is part of a series that looks into the mathematical framework of portfolio optimization, and explains its implementation as seen in OptimalPortfolio.

Maximum Likelihood Estimator

We first begin by understanding what a maximum likelihood estimator (MLE) is and how it can be used to estimate the distribution of data. Maximum likelihood estimators, when a particular distribution is specified, are considered parametric estimators.

In essence, MLE aims to maximize the probability of every data point occurring given a set of probability distribution parameters. In other words, to finds the set of parameters for the probability distribution that maximizes the probability (likelihood) of the data points. Formally, this can be expressed as

The problem with optimizing this sum of probabilities is that is almost always involves quite nasty exponentials of the parameters and that makes finding the optimal value much harder. Hence, the notion of log-likelihood is introduced. Log-likelihood is basically the logarithm of the probability that the data point occurs. Formally,

The benefit to using log-likelihood is two fold:

  1. The exponentials in the probability density function is made more manageable and easily optimizable.
  2. The product of the probabilities becomes a sum, which allows the individual components to be maximized, instead of working with a product of the n proability density functions.

The concept of MLE is surprisingly simple. The difficulty comes in effectively applying this method to estimate the parameters of the probability distribution given data. Before we discuss the implementations, we should develop some mathematical grounding as to whether MLE works in all cases. For this, consider the following:

Let

Which is the function to be maximized to find the parameters. The added factor of 1/n obviously does not affect the maximum value but is necessary for our proof. Consider

This is the expected value of the log-likelihood under the true parameters. In other words, in this is in some notion our goal log-likelihood. The Law of Large numbers states that the arithmetic mean of the iid random variables converges to the expected value of the random variables when the number of data points tends to infinity. Hence, we can prove that

This means that MLE is consistent and converges to the true values of the parameters given enough data.

MLE of Student-t

Since the usual introductory example for MLE is always Gaussian, I want to explain using a slightly more complicated distribution, the Student-t distribution. Also this is the distribution used in my OptimalPortfolio implementation. The difference between using Gaussian and Student-t is that Student-t distribution does not yield an analytic MLE solution. Hence, we need to investigate some form of optimization algorithm to solve it. It presents us with an opportunity to learn Expectation Maximization (EM) algorithm. The EM algorithm essentially calculates the expected value of the log-likelihood given the data and prior distribution of the parameters, then calculates the maximum value of this expected value of the log-likelihood function given those parameters. In general, the first step is

Then

This is repeated until the value of the parameters converges or reaches a given threshold of accuracy. This algorithm can be applied to Student-t distribution with relative ease. The crucial fact is noticing that the parameters of Student-t distribution are from the Gamma distribution and hence, the expected value calculated in the first step will be the following:

Where d is the dimension of the random variable and M is known as the Mahalanobis distance, which is defined as:

Once this is calculated, we can calculate the maximum of the log-likelihood for the Student-t distribution, which turns out to have an analytic solution, which is

The calculation of this estimates and the expectation values can be iterated until convergence. In python, it will look something like this:

import pandas as pd
import numpy as np
def expectation_max(data, max_iter=1000):
data = pd.DataFrame(data)
mu0 = data.mean()
c0 = data.cov()
for j in range(max_iter):
w = []
# perform the E part of algorithm
for i in data:
wk = (5 + len(data))/(5 + np.dot(np.dot(np.transpose(i - mu0), np.linalg.inv(c0)), (i - mu0)))
w.append(wk)
w = np.array(w)
# perform the M part of the algorithm
mu = (np.dot(w, data))/(np.sum(w))
c = 0
for i in range(len(data)):
c += w[i] * np.dot((data[i] - mu0), (np.transpose(data[i] - mu0)))
cov = c/len(data)
mu0 = mu
c0 = cov
return mu0, c0

Conclusion

Estimation of parameters of distributions is at the core of statistical modelling of data. It is an essential skill for any data scientist and quantitative analyst. In order to see how this all ties together, do visit OptimalPortfolio.

--

--

Vivek Palaniappan
Engineer Quant

Looking into the broad intersection between engineering, finance and AI