Maximum Likelihood Estimation Part 2: The Deriving

Deriving the Binomial estimator from first principles

Rob Taylor, PhD
8 min readAug 27, 2022
Photo by Richard Horvath on Unsplash

Introduction

In an earlier post, I stepped through maximum likelihood estimation and showed how to derive the estimator for the Binomial model from first principles. It was a bit of a whirlwind, for sure, and I threw a lot of equations at you. If you were left wondering where those equations came from you’ve come to the right place!

In this post, I’ll take you through some of the math in greater detail. This post does assume that you’re comfortable with some calculus and algebra. If these topics aren’t really your thing, fear not — I’ll try not to get too bogged down in the weeds so you can still follow along. However, I do run through some elements rather quickly, so if you find yourself getting lost, that's okay. I will provide links to resources where I can.

Okay, back to coin flips.

Deriving the Log-Likelihood

Let’s assume we have flipped a coin N times and have observed a total of K heads. Strictly speaking, each coin flip is a Bernoulli trial that can take on values of either 0 or 1. The probability for each of these outcomes is governed by the rate parameter, p, where P(X = 1) = p and P(X = 0) = 1-p. We can arbitrarily assign heads (or tails) to either 0 or 1 so in keeping with the previous post we’ll let heads = 1 and tails = 0.

Our sequence of coin flips will now be a series of 0’s and 1’s, which is convenient because we can derive the number of observed heads using a simple summation:

A simple summation (image by author).

Having the situation set up in this way simplifies things somewhat and means the likelihood function can be expressed in the usual way:

The Binomial likelihood function (image by author).

where p is the unknown parameter to be estimated.

Now that we have the likelihood function we typically want to work with the log of this function. Why? Well, two reasons (actually, it’s just one reason…).

If you take a small value — a value less than 1 — and multiply it by itself many times, the returned value becomes smaller and smaller, and pretty soon you’ll be drowning in a sea of significant digits and scientific notation. Not ideal. If you take a look at the likelihood function above, you’ll see that we have small values raised to the power of K and N-K. The log transformation helps with this by essentially expanding the range between 0 and 1. It allows smaller values to take on increasingly negative values.

The flip side of this case is where you multiply a large value — values greater than 1 — by itself many times. Here, the returned value grows larger and larger, which means you can be dealing with some pretty heavy numbers. In this case, the log transformation compresses these growing values, bringing them down to a more manageable range.

That was a very quick and dirty digression, but hopefully, it provided enough of a rationale for why we like log transformations. Speaking of, let’s now derive the log-likelihood function. To do so we just need to take the log of both sides and simplify:

The log-transformed Binomial likelihood function (image by author).

So, all that’s happening here is that each term in the likelihood function is log-transformed.

For those who are still a little unsure of what has happened here, there are two things you need to know. First, if you take the log of the product of two values (actually, you can take any number of terms), that’s the same as adding the log-transformed values:

First thing you need to know (image by author).

The second is what happens when you take the log of values that have been raised to some power:

The second thing you need to know (image by author).

If you equate b with K and N-K, and a with p and 1-p, hopefully, you can see where the terms in log-likelihood come from.

Onward!

First Derivative of Log-Likelihood Function

Right. To find the maximum likelihood estimate we need the first derivative of the likelihood function with respect to p. Remember, the derivative should have a value of zero when p is at a minimum or maximum value. Now, because we know the likelihood function for this example is a concave function of p, we know that a global maximum exists for some value of p, which will the point at which the derivative is zero.

So, let’s take the first derivative of the log-likelihood function with respect to p:

Derivative of the log-likelihood function with respect to p (image by author).

Phoar!

If all these hieroglyphics seem like gobbledegook to you, all that’s happening here is the application of certain derivative rules. If the rules are unfamiliar to you I have provided a very quick primer in the appendix below. But all that matters, really, is the result in the bottom line, because that is the derivative.

Deriving the Maximum Likelihood Estimate

Okay. so we now have an expression for the derivative. The next step is to actually find the value that reduces this expression to zero. So, how do we do that?

Well, one method is trial and error: just plug in different values for p and see what you get. That’d get tedious pretty quickly and fortunately, there’s a better way. Remember, we want the value of p that coincides with the zero point of the derivative. This means that if we set the derivative to zero we can then solve for p. So let’s do that.

But before we do, I always found it helpful to note the following equality:

Useful equality (image by author).

If we then set the left-hand side to zero and do some algebra we obtain the following:

Solving for p (image by author).

and ta-da! We have the result:

The maximum likelihood estimate (image by author).

You can check this result by plugging the MLE into the first derivative. It should return a value of zero. But just to be sure, suppose we observe K = 7 heads out of N = 10 throws. The MLE for this data is therefore 0.7. Let’s plug these values in and see what happens:

Sanity checking (image by author).

Happy days.

A Note on Second Derivatives

In my earlier post, I mentioned that it is also necessary to check the second derivative of the likelihood function to ensure that a global maximum has been obtained. This is generally a good practice — but I’m not going to do that here. Given that we know the likelihood function is concave the solution we have derived is unique. However, for models with several parameters and complex likelihood functions, it’s probably a good idea to check your solutions — so long as an analytical solution exists. If not, you’ll likely need to use numerical methods to estimate the parameters of your model. All of this is a little beyond the scope of this post, but down the line, I’ll discuss some of these methods.

Summing Up

If you’ve got to this point, thanks for hanging in there.

I covered a lot of rather technical stuff in this post and, to be fair, it was pitched with a more technical audience in mind. Nevertheless, even if calculus and algebra aren’t your strong suit, I’m hoping you got something useful out of this demonstration. I’m not sure how successful I was in avoiding the weeds but hopefully, the slight digressions helped you navigate them in some way.

Thanks for reading, and as always, feel free to reach out, or leave a comment.

Appendix

Some rules for derivatives.

The constant rule states that the derivative of constant values is zero:

Derivative of a constant (image by author).

The identity rule states that the derivative of the identity function g(x) = x is 1. This function simply returns the value of its argument which implies:

Derivative of the identity function (image by author).

The logarithmic rule states that the derivative of the natural logarithm (ln) is equal to the reciprocal of its argument:

Derivative of the natural logarithm (image by author).

The multiplication by a constant rule states the derivative of a differentiable function multiplied by a constant is the same as the constant multiplied by the derivative:

Derivative of a function multiplied by a constant (image by author).

The sum rule states that the derivative of the sum of two differentiable functions is equal to the sum of the derivatives:

Derivative for the sum of two functions (image by author).

The difference rule is just like the sum rule, except…you subtract rather than add. Madness.

The chain rule states that the derivative of the composite of two differentiable functions is equal to the derivative of the outer function multiplied by the derivative of the inner function (with some abuse of notation):

Derivative for a composite function (image by author).

If you enjoyed this post and would like to stay up to date then please consider following me on Medium. This will ensure you don’t miss out on new content. You can also follow me on LinkedIn and Twitter if that’s more your thing 😉

--

--

Rob Taylor, PhD

Former academic, now data professional and writer. Writing about statistics, data science, and analytics. Machine learning enthusiast and avid football fan.