Engineering a Minion Army to prevent the moon from being stolen with Maximum Likelihood Estimation

9 min readMar 26, 2022

Introduction

In this day and age, the prevalence of data has increased through our interactions with technology. With all this data, how do we make sense of it? There are a lot of ways we can approach this million-dollar question. One potential approach could be to understand how our data might be generated. This approach will be the focus of our article.

While we sometimes can understand the source of our data (eg: sensor readings with known configurations, camera images with known calibration, etc), sometimes, we might not know too much about the source of our data. If we don’t know the underlying source of our data, what if we can have a way of estimating it?

This article will focus on an approach to estimate the underlying distribution of our source of data called maximum likelihood estimation (MLE). When I first came across this term, I didn’t really gain an intuitive understanding of why this technique works. I hope to demystify some of this confusion in order to provide a friendly introduction to the topic.

So, we need to estimate the underlying distribution, but where does MLE come into the picture?

When we get data, typically, we are working with a sample of the population or problem of interest. Usually, it is very infeasible to get a dataset for the entirety of the population (you can imagine the time, cost, overhead, labor, etc). Now that we have data for our sample, our goal is to estimate the underlying distribution from which the dataset could have originated.

Sounds daunting? I totally understand. There are so many distributions we could potentially come up with, but there are several well-known distributions that pop up in many real-world applications (eg: binomial distribution, Poisson distribution, Gaussian distribution, F-distribution, Chi-Square Distribution, etc). Since we know the problem we are trying to solve, maybe we could let the problem domain dictate what distribution we try. Now that we have settled on a distribution, we need to tweak the distribution to get something that could have “reasonably generated our sample data”.

Now that we have settled on a distribution, couldn’t we just try every possible combination of parameters until we get a reasonable-looking source distribution?

Theoretically, you could, but this approach would badly fail in practice. There are infinite upon infinite combinations of parameters we could try. Also, what is our proxy for knowing if we “get a reasonable looking underlying distribution?”. In the case of say a normal distribution, below is a subset of the many possible mean and standard deviation combinations (tweaking the mean and standard deviation changes the look of the bell curve in terms of width and height).

Figure 1: Subset of many possible mean and standard deviation combinations for a normal distribution

Now that you are convinced that simply trying every possible combination of dials and knobs for our chosen distribution is inefficient, we still need to address how we know that we got a “reasonable looking distribution that generated our data”. This is where MLE comes into the picture.

What is our proxy?

In the case of MLE, our “proxy” is maximizing the probability that our dataset came from our chosen distribution. In other words, given our dataset and a chosen distribution (based on our problem statement and other information), we want to come up with a way to find the dials and knobs that make our dataset very likely. This proxy is a good start because when we are solving our problem, we have our dataset, but we have to work off of that.

All of these things will hopefully come together in our example that follows.

Storytime!

Before diving into hypothesis testing, it’s storytime! Grab a warm glass of milk (I personally am a Swiss Miss guy, but you do you!), some popcorn, a cozy blanket, and get comfortable.

Imagine that we are in the world of Despicable Me. Gru has been hatching up a plan to steal the moon with the help of his 100,000+ hoard of minions. Gru has developed a shrink ray that can shrink the moon down to a reasonable size for him to keep.

Figure 2: Gru’s devious plan to steal the moon

If Gru steals the moon, then humanity is threatened since the moon is responsible for some of Earth’s activities. We can’t let Gru get away with this, so we need to stop him!

Unfortunately, Gru is able to get his devious plan moving quickly because he has an army of minions to help him out. We will need to design 200,000+ of our own minions to put together a force that will stop Gru. Fortunately, you were able to obtain the blueprint for Gru’s minions as shown below:

We are able to design the minion exterior easily (though we don’t use the classic shade of yellow to distinguish our minions from Gru’s). Our minions are able to move around, but our formidable force of minions is nothing without communication. We need to design the minion brain such that each minion can communicate and understand one another. This is where minion language falls into place.

Minion language is the coordination of sounds that minions use to communicate. We as humans have a hard time understanding minion language, but what we do know is that minion language has a distinctive start and end sound. When a minion wants to communicate one word after another, there are obvious pauses (which aren’t too bad to catch). We want to keep our minion communication as fluid and efficient as possible. One way we will do that is by understanding the average length (word count) of a minion dialogue and using this to find the probability of hearing the end sound.

Your chief scientist has recorded a series of 50,000+ minion dialogues through hidden cameras in Gru’s lab. Each dialogue is analyzed to produce the length of the minion dialogue (the recordings are annotated with where the pauses are, so finding the word count of a particular minion dialogue is easy).

Now that we are able to find the length of each of the 50,000+ minion dialogues, your job is to figure out the probability of hearing the end sound in order for you to engineer a more efficient force of minions and save humanity from Gru’s actions!

Carrying out MLE

We have a sample of 50,000+ minion dialogues and we were able to find the word count of each dialogue. In order to make our problem a bit easier, let’s assume that each minion dialogue is independent (probably not the best assumption in context, but this will suffice). In other words, each minion dialogue isn’t affected by the other dialogues that go on. We will also assume that each word is independent for each dialogue since minion language is very mumbo jumbo to us.

When we observe our problem, we know that the end sound occurs only one time and that signals the end of the dialogue. Also, since we know the word count of each dialogue, it makes sense to conclude that the end sound will be the “last word” in the dialogue. In order to have a general approach, let’s suppose p is the probability of hearing the end sound. Since we don’t hear the end sound until the end of the dialogue, the probability of not hearing the end sound is (1-p). If we consider a dialogue with m words, then the probability of hearing the end sound after m words is:

Take a moment to examine why this probability makes sense in the context of our assumptions. In case you were wondering, this distribution is what we call the geometric distribution. This distribution helps us find the probability of the first success occurring after m trials. Now that we have this grounding in place, let’s apply this observation to each of our 50,000+ dialogues. Suppose that we have a list of word counts of the form

Figure 5: Sample of word counts (general)

where W_{i} represents the word count of the ith dialogue. We have assumed that each dialogue occurs independently, so we can easily find the probability of our sample in the context of the geometric distribution.

Figure 6: Probability of sample

Notice that we can re-arrange and condense this product into the following

I was using properties of exponents and summation notation as shown above. Take a moment to verify that the equality that I presented makes sense. What we have now created is a likelihood function.

Instead of writing “Geom” over and over, we observe that in our geometric distribution, the only dial or knob you can turn is the value of p, which is what we want to solve for at the end of the day. Therefore, our likelihood function is:

Figure 8: Another way to write the equation in Figure 7

Generally, products are harder to work with due to numerical instability (imagine multiplying a very small number with another very small number? Your product would be teeny tiny and your computer might not be able to store all the digits you desire). We borrow a trick from your algebra.

We can turn products into sums through the use of a logarithm. A neat property is that

Figure 9: Product Property of Logarithms

Since the logarithm function grows in an increasing fashion (AKA monotonically increasing), maximizing our likelihood function is equivalent to maximizing the log of the likelihood function. Therefore, we can propose maximizing the following:

Figure 10: Alternative optimization problem using log properties

What we get here is commonly referred to as the log-likelihood function. To maximize the log-likelihood function, we can use calculus to solve. We can take the first derivative, find where the first derivative is zero, and verify using the second derivative that we indeed found a maximum.

Figure 11: First derivative of log-likelihood

Doing a bit of algebra, we get that

Figure 12: Solving for the optimal value of p

We cross multiply and distribute

Figure 13: Algebra from Figure 12 continued

Therefore,

where S-bar is the sample mean. We then check that our optimal value for p is indeed a maximum through the second derivative. Observe that the second derivative for our log-likelihood function is

Figure 15: Second derivative of the log-likelihood function

Since both terms are negative, then our second derivative is negative. Therefore, at our value of p we solved for, the log-likelihood function concaves down, so our value for p is a maximum for the log-likelihood function.

Therefore, if we are given a set of minion dialogues, we can get the word count for each of these dialogues, take the mean of the word counts and find the reciprocal. This should give us the probability of hearing the end word.

What happens now?

Through maximum likelihood estimation, you are able to engineer a well-oiled minion army, flock to Gru’s lab and dismantle the rocket Gru was about to use as part of his plan. You also have our minion army confiscate Gru’s shrink ray. Now that you have saved humanity, your minion army is in the mood for ice cream!

Conclusion:

Hopefully, you found this article to be a fun way to introduce the concept of maximum likelihood estimation without getting bogged down by the terminology. Now that you have the tools to mathematically test your assumptions, be curious. Go out there, explore, and keep learning something new each and every day! If there are any questions or comments, feel free to post them.