# Maximum Likelihood Estimation (MLE)

Published in

--

In this guide, we will cover the basics of Maximum Likelihood Estimation (MLE) and learn how to program it in Stata.

If you here, then you are most likely a graduate student dealing with this topic in a course or programming some estimation command in Stata. But MLEs also have another purpose. They also form the backbone of Machine Learning techniques. In their core essence, MLEs are parametric estimations, where, given some data, and some assumption about the functional form of the data, we need to find the optimal values of a parameter set. These parameters give us a distribution function that is most likely to describe the data we are observing. There are also non-parametric MLEs where the distribution functions are derived from a training data set, or other data that closely resembles our data. For example, if we have a million pictures of cats, we can use them to classify a cat in our own picture.

In this guide we will deal with parametric MLEs that are commonly used in standard econometrics. Here we will introduce the basic concept of MLEs and learn how to apply them in Stata. As a side note, MLEs are not new in Stata. In fact the `ml` commands exist as far back as the earlier versions of Stata. The earliest documentation I found was in v6 that was in use around 2002-03 but it could easily be much earlier since they are also required for the standard `probit` and `logit` regressions. The `ml` functions rely heavily on Mata’s `optimize` routines that are also fairly old but their documentation and official release happened not so long ago (see the Mata guide here for an intro).

Before we start, in order the replicate the graphs exactly as they are in this guide, you can install the TSG_schemes and use white Tableau (`set scheme white_tableau`). This guide also assumes that you have knowledge of basic probability theory, dealing with optimization problems, and have a broad overview of MLEs.

# A simple example: Coin tosses

Let’s start with the classical example of a coin toss. Coin tosses have binary outcomes; a Head (H) or a Tail (T). Each coin toss has the same initial condition, that is, the coin remains unchanged, and different coin tosses don’t impact each other. In statistics, these properties imply that coin tosses are i.i.d. or identically, independently, distributed, which allows us to derive other important properties of estimators. This stuff you will probably in your courses.

Let’s assume that the probability (Pr) of getting a Head (H) equals some unknown value `p` or `Pr(H)=p`. In other words, the probability of getting a Tail (T), which is the probability of not getting a head (H’ or 1-H) equals `1-p`. More formally `Pr(T)=Pr(1-H)=1-p`.

So how do we find that value of `p` using MLE? We do some coin tosses. Let’s do five tosses to keep this tractable since we also be solving some stuff by hand. We toss the coin five times, and we get the following sequence: H,T,T,H,H. And now we ask, what is the Likelihood (L) of getting a head (H) from the data we observe? More formally, what is `L(p)`? How do we define the likelihood? Here we can assume that the likelihood is derived from a joint probability distribution of getting a Head in the five turns we observe. From statistics we know that the joint probability distribution equals probability of outcome1 times probability of outcome2 times the probability of all the other outcomes. So for our problem the likelihood function can be written as:

or the probabilities of the sequence H,T,T,H,H multiplied with each other. We can simplify this to:

where `3` is the number of times H appears an 2 is the number of times T appears. In simpler terms, if we have a total of N tosses, out of which n are Heads, then we reduce the above problem to this generic functional form:

This is effectively a Bernoulli distribution that is used to represent coin toss-like binary outcomes. Given that we now have a function which one unknown p, this simply becomes an optimization problem. Here we can differentiate L with respect to p and set it equal to zero to find the optimal value of p (`L'(p)= dL/dp=0`). But since we have powers in the equation above, this will just give us this very ugly expression:

And good luck with solving this for p! Dealing with powers and exponentials is a common problem with likelihood functions since most distributions have powers in one form or the other. That is why likelihood functions are transformed into log-likelihoods (LL). Logs effectively get rid of the powers, convert multiplications into additions and therefore linearlize the likelihood function, and most importantly, they are easy to differentiate. Plus, since they are a monotonic transformation of the original likelihood function, the solutions are exactly the same.

So a LL of the first expression would look like this:

Taking the derivate with respect to `p` and solving this for `p` this would give us:

A much more tractable solution that does not involve dealing with exponentials.

From our example, since we know that `n=3` (3 Heads), and `N=5` (5 tosses), the probability `p` equals `p = 3/5 = 0.6`. If we start increasing the number of tosses (`N` becomes large), then the probability will converge to 0.5 if the coin is fair.

There are some nice proofs that deal with asymptotic properties of these distributions and MLE estimators. But in summary, MLE’s asymptotic properties are exactly what makes them an attractive tool. Given certain conditions they are asymptotically efficient, consistent and normally distributed, making them the BLUEer relative to other class of estimators.

# The Stata part

So how do we do this in Stata? Since we have already solved a bunch of stuff by hand, setting up this problem is fairly easy. Here we will take two routes. First we will solve this problem again “by hand” in Stata, and then we will use the built-in `ml` functions.

Let’s generate a variable that represents the coin tosses:

`clear allset obs 5gen n = _ngen x = .replace x = 1 in 1replace x = 0 in 2replace x = 0 in 3replace x = 1 in 4replace x = 1 in 5`

Since we know that the log likelihood (LL) function equals: `LL(p) = 3 *log(p) + 2 * log(1-p)`, we can also plot this as follows:

`twoway (function 3 * log(x) + 2 * log(1-x), range(0 1)), ytitle("LL")`

which gives us this figure:

For the figure we can see that the LL has a maximum point around 0.6. As we discussed in the Mata guide, one of features we can utilize are `optimize` functions. Let’s use them to find the maximum point:

`matavoid myfunc(todo, x, y, g, H) { y = 3 * log(x) + 2 * log(1-x) }maxval = optimize_init()optimize_init_which(maxval, "max")optimize_init_evaluator(maxval, &myfunc())optimize_init_params(maxval, 0.2)xmax = optimize(maxval)xmaxend`

Without going too much in the details, here we define a function that we want to maximize, and start the search with an arbitrarily chosen value of 0.2 which we can see is closer to the maximum point in the figure. For such a simple optimization problem, the starting value shouldn’t really matter. From the code above, we get the value of `xmax = 0.6`, which is what we calculated by hand as well.

We can also plot this on the graph above:

`matavoid myfunc(todo, x, y, g, H) { y = 3 * log(x) + 2 * log(1-x) }maxval = optimize_init()optimize_init_which(maxval, "max")optimize_init_evaluator(maxval, &myfunc())optimize_init_params(maxval, 0)xmax = optimize(maxval)xmaxymax = 3 * log(xmax) + 2 * log(1-xmax)st_local("maxx", strofreal(xmax))st_local("maxy", strofreal(ymax))endtwoway (function 3 * log(x) + 2 * log(1-x), range(0 1)), /// yline(`maxy') xline(`maxx') ytitle("LL") xlabel(0(0.2)1)`

from which we get a LL max value of -3.365:

But since Stata has built-in MLE functions (see `help ml`) with fairly extensive documentation, we don’t need to jump through the hoops to define an optimization problem in Mata. We can simply use the `ml` function. If you see the documentation of this command, it won’t be surprising if you are immediately confused. This use of this command is a bit convoluted and takes some time to get used to the various options. In order to use `ml` commands, we need to write small programs where we define the LL functions that need to be optimized. In the coin toss case, since we already know the LL function which is linear in form, we can make use of the `mlexp` option (see `help mlexp`). The documentation of `mlexp` says the following:

So here, since we know the LL which fulfills the conditions defined above, we making use of the special `mlexp` option. The `ml` is more generic, but we will get to it later. We can express our function as follows:

`mlexp ((x * log({p})) + ((1 - x) * log(1 - {p})))`

and that is all we need. A single line equation. There are of course a host of various other options that go with it including defining initialization of parameters, other constraints etc. Note in the command above that `x` is the Stata variable that we have generated, and `{p}` is the value we need to maximize. This value should always be written in curly brackets. There can also be more than one parameter, each in its own curly brackets. From the expression above, we get this output:

The iteration values are similar to what the optimize gives us. In fact, here I would like to point out that in the background, the `ml` functions actually use the Mata `optimize` routines. So now you know where all the magic happens.

The value of `p` is correctly estimated as `0.6`. The standard error is derived from the asymptotic variance of the finite sample, which in our case is very small. If we increase the number of coin tosses, then `p` should approach 0.5, and the standard error should approach 0, the true values of the population distribution. We can check this as follows:

`clear allset obs 1000000  // or higher or lowergen x = uniform() < .5  // a 0/1 variablemlexp ((x * log({p})) + ((1 - x) * log(1 - {p})))`

which gives us:

The command `mlexp` also has several post estimation options for conducting various tests, linear combination of parameters, creating margins plot etc. See `help mlexp postestimation` for more details. See also this Stata blog entry for a more advanced application of `mlexp`.

Now on to more standard applications.

# Logit regressions

Another common example of MLEs are Logit or Probit regressions. These are standard tools when dealing with dichotomous outcomes that can be explained by some independent set of variables. For binary outcomes, a linear regression is obviously a bad choice here since it will predict outcomes that are fall outside the plausible [0,1] range. This is where the S-shaped Logit and Probit functions are really handy since they contain the outcomes within this range. From these models, we basically recover the probability of having a value of one. Again this theory you should either know or read up on.

The likelihood of the Logit is derived from a logistic distribution, which takes the following functional form:

While this formula might seem complex, it follows the same logic as the coin toss example. Here we have have a probability if `y = 1`:

`Pr(y = 1) = exp(z) / (1 + exp(z))`

and a probability for `y = 0`:

`Pr(y = 0) = 1 - exp(z) / (1 + exp(z))          = 1 / (1 + exp(z))`

The likelihood formula shown above is the joint probability distribution of the binary outcome variable y.

The likelihood of the Probit model is derived from a normal distribution, that looks like this expression:

In both the likelihood distributions, `z = X beta` is a matrix of explanatory variables and `n` are the number of observations going from 1 to `N`.

Since these formulas are a pain to solve “by hand” especially the Probit model, I will stick to the Logit since it is the easier one to deal with. Like the coin toss example, I will use a custom set of values to keep this relatively tractable. Let’s generate five observations:

`clear allset obs 5gen y = .gen x = .replace y = 1 in 1replace y = 0 in 2replace y = 0 in 3replace y = 1 in 4replace y = 1 in 5replace x = -1 in 1replace x =  1 in 2replace x = -4 in 3replace x =  8 in 4replace x = -5 in 5`

where `y` is the outcome variable, and `x` is the explanatory variable. There can be many more explanatory variables but since we want to graph the likelihood functions, we just use a single explanatory variable. In our model, we also include an intercept, such that `y` is explained as a logistic function of the `z = alpha + beta x`.

For the observations above, the likelihood function can be derived as:

Since this is a pain to differentiate and solve for `z={alpha, beta}`, we therefore make use of the log likelihood (LL) expression:

which seems a bit more benign than the likelihood. The generic logit LL can be written as:

Since we have the LL explain by two parameters `alpha` and `beta`, we can also visualize this as follows:

where we can see that the LL has a maximum point somewhere in the red zone.

In order to solve this in Stata, we make use of the standard `ml` functions (see `help ml`). For the logit case, we need define a program as follows:

`cap prog drop mllogitprogram define mllogit     args lnf Xb     quietly replace `lnf' = ln(    invlogit(`Xb')) if \$ML_y1==1     quietly replace `lnf' = ln(1 - invlogit(`Xb')) if \$ML_y1==0end`

Based on the code above, the `ml` arguments are described as follows. The program requires arguments `args` where we provide two `lnf` and `Xb`. These could be any names but are suggested in Stata documentation for clarity. The first `lnf` is an abbreviation for “linear form” since we have a linear LL model here. The second one `Xb` is the generic term for covariates. We specify the conditions for the dependent variable y=0 and y=1 separately. Here y is denoted by the generic term `\$ML_y1`. The distribution `invlogit` is a built-in function where `invlogit(x) = exp(x)/(1 + exp(x))`. An alternative way to specify the the above conditions would have been:

`program define mllogit2     args lnf Xbquietly replace `lnf' =       - ln(1+exp(-`Xb')) if \$ML_y1==1quietly replace `lnf' = -`Xb' - ln(1+exp(-`Xb')) if \$ML_y1==0end`

which is a bit more verbose, but this is up to taste. More advanced programmers would save the program in a separate ado file which is called in a do file. But for now, we keep this program itself inside the dofile to avoid complicating the whole thing further. Note for optimization, it does not matter if you use `log` or `ln`. It won’t change the results.

Next step we need to call this program in the `ml` instance. This is done as follows:

`ml model lf mllogit (y = x)`

Here lf is the evaluator for the linear form. There are several other evaluators built inside Stata, which are used for the `optimize` functions as well. I won’t go in their details here since it requires introducing some more theory, but basically `lf` corresponds to `lf0` (the default) which estimates the parameters, `lf1` estimates the parameters plus the gradients (first derivative), and `lf2` calculates the parameters, gradients, and the Hessian (second derivative). These require additional inputs in the `ml` functions and are used for more complex calculations especially if the functions are highly non-linear or some custom initializations are required, or if there are restrictions on the gradients and the Hessians.

After we use the above `ml` command, a bunch of options open up. You can try these and also look up their documentations:

`ml check  // conducts tests to see if the ML is properly set upml query  // a complex summary of the problem set upml search // searches for decent initial valuesml plot x // same as search but uses a graphical interfaceml graph  // graphs the log likelihood value and the iteration`

In order to actually solve the model, we can write:

`ml maximize  // or minimize depending on the setup`

which gives us this output:

where we get the intercept (`alpha`) and the slope of x variable (`beta`). We can also check this against the standard logit command:

`logit y x`

which also gives the same output but with fewer iterations. Here the `logit` (or the `logistic`, both commands are same in Stata), also utilizes the `ml` option in the background to derive these results.

Just as an additional point, and since several people struggle with this, we can also draw the logistic curve as follows:

`logit y xmat li e(b)global b0 = e(b)[1,2]global b1 = e(b)[1,1]display "(\$b0 * x) + \$b1"cap drop yhatpredict yhattwoway  ///  (scatter yhat x) /// (function exp((\$b1 * x) + \$b0) / (1 +exp((\$b1 * x) + \$b0) ), range(-60 60)) /// , ytitle("Probability") legend(order(1 "Fitted values" 2 "Fitted curve"))`

which gives us this figure:

With logit functions, one can also derive margin plots, odds ratios, etc. but these additional post-estimation calculations we won’t touch here. These can be extracted from the derived estimates and it is useful to refer to the Stata manual or some text books.

# Probit functions

Instead of the logit function, we can also use the Probit function derived from the normal distribution. The LL for the Probit is simply the log of the likelihood we defined earlier. Since the actual LL for our data looks fairly ugly (there won’t be enough space to even fit it in one or two lines), I won’t post it here but this sort of stuff is easy to derive symbolically in Mathematica.

Just out of curiosity, if we compare the LL functions of the Probit and the logit, we can see the differences clearly:

The flatter surface is the Logit LL while the more concave shape is the Probit LL. Here you can also see that the edges of the Probit function starts getting a bit rough. These are most likely out-of-sample domains where one starts hitting some non-linear combination of the variables.

In Stata, we can define the LL for the Probit as follows:

`cap prog drop mlprobitprogram mlprobit args lnf xb   quietly replace `lnf' = ln(    normal(`xb')) if \$ML_y1==1   quietly replace `lnf' = ln(1 - normal(`xb')) if \$ML_y1==0endml model lf mlprobit (y = x)ml maximize`

which gives us:

This can also be compared with the standard Stata command:

`probit y x`

and to draw the graphs that compare the Probit with the Logit, we can use the following set of commands:

`probit y xglobal p0 = e(b)[1,2]global p1 = e(b)[1,1]twoway  ///  (function y = exp((\$b1 * x) + \$b0) / (1 +exp((\$b1 * x) + \$b0) ), range(-60 60)) /// (function y = normal((\$p1 * x) + \$p0), range(-60 60)) /// , ytitle("Probability") /// legend(order(1 "Logit" 2 "Probit"))`

which gives us this graph:

Here we can see differences in the cumulative density functions (CDFs) for the two functions.

## Using actual data

Let’s try what we learned on actual data. Here we will use Stata’s union dataset, that is also frequently used in their help files for examples with binary outcomes.

`webuse union, clear`

Here we use a simple specification, where we explain union membership (a binary outcome) as a function of age, grade (education completed), (living in the) south, and (being) black.

We can run a simple `logit` and `probit` in Stata as follows:

`logit union age grade south blackprobit union age grade south black`

In ML form we can use the functions we defined above:

`cap prog drop mllogitprogram define mllogit     args lnf Xb     quietly replace `lnf' = ln(    invlogit(`Xb')) if \$ML_y1==1     quietly replace `lnf' = ln(1 - invlogit(`Xb')) if \$ML_y1==0endcap prog drop mlprobitprogram mlprobit args lnf xb    quietly replace `lnf' = ln(    normal(`xb')) if \$ML_y1==1    quietly replace `lnf' = ln(1 - normal(`xb')) if \$ML_y1==0endml model lf mllogit (union = age grade south black)ml maximizeml model lf mlprobit (union = age grade south black)ml maximize`

The code above will spit out the parameter values that maximize the LL function. You can also check the results against the standard commands. As mentioned earlier, the Stata `probit` and `logit` commands utilize the `ml` functions, which themselves uses the Mata `optimize`. The inter-connectedness of these routines also helps give a picture of how various different functions are built on top of each other. And so running `logit y x1 x2…` seems like a fairly easy command to use, there is a lot happening in the background which makes it easy to use!

# Other ML functions?

Since there are tons of distributions that one can apply to various data problems, there are tons of MLE functions as well. This UCLA website lists a bunch of them. But one I would like to highlight here is the MLE to deal with standard regressions. Using the auto dataset, this takes the following form:

`sysuse auto, clearcapture program drop myolsprogram myols  args lnf Xb lnsigmalocal y "\$ML_y1"quietly replace `lnf' = ln(normalden(`y', `Xb', exp(`lnsigma')))endml model lf myols (xb: mpg = weight foreign) (lnsigma:)ml maximizedisplay exp([lnsigma]_cons)`

which gives this output:

where the `lnsigma` term is the mean squared error (MSE) of the regression. Now if we compare it with the standard OLS command:

`regress mpg weight foreign`

we get this output:

Note here that the coefficients are exactly the same but standard errors and the MSE terms are different. Here, unlike `logit` and `probit`, which also utilize `optimize` and hence results are similar, for OLS the estimations routes are very different. The `regress` command use standard matrix algebra while the `ml` option above is utilizing asymptotic properties of the estimators. Even though the differences are very small, the `ml` results should be considered better (or BLUEr) than the OLS estimates.

There are still several things not covered in this guide but are worth looking up if you are serious about learning `ml`. These include more complex functional forms, constraints on MLE functions, optimization techniques (Newton-Rhapson, Berndt-Hall-Hall-Hausman etc.), search bounds etc. Currently there is only one book by Gould, Pitblado, and Poi which covers this topic in detail:

And that is it for this guide! Hope you found it useful. Please send comments, feedback and errors if you find any. These guides are revised every few months so feedback is always taken into account.