Parameter Inference — Maximum Aposteriori

Rahul Bohare
Towards Data Science
15 min readApr 6, 2017

--

I dare you to not laugh at this comic after going through the post. :D

In the previous post, we discussed the motivation behind Maximum Likelihood Estimate and how to calculate it. We also learned a few tricks about calculating the log likelihood of a function by citing the application of monotonic functions, and how they make the entire process of estimating the critical points of a function much easier as they preserve those critical points.

Towards the end of the MLE post, I tried motivating the reasoning behind going for MAP (Maximum Aposteriori) by asking a simple question:

What if the sequence looked like following:

Img. 1: A sequence of two Heads

What do you think the probability of the third flip being Tails is?

Clearly, in this case,

Img. 2

Why? Put the #tails (0) and #heads (2) in the equation of theta_MLE,

Img. 3

and we arrive at the result. This result tells us that the probability of next flip being Tails is 0 (i.e., it predicts that no flip is ever gonna turn up Tails => the coin is always going to show Heads), and it is glaringly obvious that this is not the case (barring the extreme case where the coin is heavily loaded). Now, this poses a big problem in the Parameter Estimation process because it does not give us the accurate probability of the next flip. We know that even a fair coin has a 25% chance of showing two Heads in a row (0.5 x 0.5= 0.25). So, it’s not that unlikely that the coin might be fair.

The Posterior

Although we know that MLE is a strong tool in Machine Learning but it does not come without it’s flaws (as we just saw), and these flaws occur when we have a limited amount of data available. The problem with MLE is that it is a point estimate, i.e., we are allowed to calculate the MLE at one particular value and this leads to overfitting (For those who have not heard of the term before, I’d refer you to this answer on Quora). Because it is a point estimate, it’s overfitting to the data (2 Heads in a row) and it’s not taking into account the possibility that the coin might still be fair (or might be only slightly biased towards Heads). The obvious question: How do we counteract this problem?

Typically, we have prior beliefs about the processes going on in the world that have nothing to do with math. Let’s take a simple example: Suppose you are betting in a game of “Guess the flip” against your friend. Now it could be that you might assume that your friend has rigged the coin slightly to be biased in their favor, maybe making it a bit loaded in favor of Heads; 55% Heads and 45% Tails (Nobody is stupid to make it extremely biased towards a particular flip, as it would be fairly easy to detect). These assumptions you make about a random process are called as Priors. In technical terms:

A prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one’s beliefs about this quantity before some evidence is taken into account.

Is there any way to incorporate these prior beliefs into our model mathematically?

YES: We can make the parameter theta itself a random variable, and it also justifies the notation that we have been working with; P(F = f|theta) [Note how nicely this assumption works out considering this notation already implies that theta is a random variable].

Now we can have distributions over theta, and incorporate the notion that even in extreme cases (HH), the coin being fair might still be a possibility. And as we’ll see, this notion also helps us in preventing Overfitting.

For any x, we want to be able to express the distribution of our parameter theta after seeing data, and we do that by conditioning theta over the observed data sequence, like so:

Img. 4: Posterior notation

where D is a random variable that captures the observed sequence at hand i.e., our data. And now that theta is a random variable, it can take on specific scalar values x. Because we are talking about the coin flips, we know that the above equation only makes sense for x in [0,1] (Since x is a probability, it has to be between 0 and 1).

I realize that this is a lot to take in. So, let’s take a step back and try to understand what was going on in MLE and how it compares to our ongoing MAP estimation:

Img. 5: MLE - Likelihood is the probability of data given the parameter, and we were maximizing this w.r.t. theta.

And now we are interested in:

Img. 6: MAP — Maximizing the probability of theta after we have observed our data.

It’s not completely obvious right now what the MAP term intuitively means, but it will be by the time we will be done with this post.

And in general,

Img. 7: MLE != MAP

We can apply the good old Bayes’ rule to make the MAP formulation a bit more demystifying:

Img. 8: Bayes’ rule on MAP (Posterior)

The numerator consists of:

Img. 9: Numerator I — Likelihood

We know this part from MLE, with the only difference being that it is now fixed with theta = x. It is called Likelihood.

Img. 10: Numerator II — Prior

This is our prior belief in the value of theta before observing any data. It is called the Prior and it prevents Overfitting. A little motivation on Priors in general:

In a lot of cases, we have reason to believe that some values for theta are more likely than others. For example, in our coin flip case, we’d expect that the value is somewhere around 0.5 and extreme values such as 0 or 1 are less likely. This is what is captured by Prior and it will prevent extreme results (Overfitting).

The denominator p(D), called Evidence, is a normalizing constant, and is not so important in our case. Without the denominator, the expression on the right-hand side is no longer a probability, and thus will not range from 0 to 1. The “normalizing constant” allows us to get the probability for the occurrence of an event, rather than merely the relative likelihood of that event compared to another. Check out this discussion on stackexchange.

What we should take home from all this discussion is three terms:

  1. Prior
  2. Likelihood, and
  3. Posterior

and that they are connected by Bayes’ rule. So, Prior is our belief in how theta looks like before we observed any data, and Posterior is our updated belief as to what theta looks like after we have observed some data.

Now we have established a link between likelihood (which we have been using in the last post) and posterior (which is of our interest now). And the link between these two is the Prior distribution, and Prior distribution is part of our model. So, it is us who get to choose the prior. The point of the prior is that we have to choose it without looking at observations, and this leaves room for (possibly subjective) model assumptions. There is no particular rule that prior needs to fulfill except that it needs to be a valid probability distribution:

Img. 11: Prior is just a probability distribution.

Let’s see a couple of examples of Priors:

  1. We assume that possibilities other than 0 (which was the MLE for HH sequence) exist. Now this prior is extremely weak in the sense that it does not give much information about what those possibilities actually are.
  2. We can also assume that the parameter theta is most probably in the region between 0.4 and 0.5, which is around the true probability of 0.5 (despite the MLE solution telling us that it is 0). Since we are confining our assumption to a certain region, this is an example of a fairly strong prior.

These subjective assumptions are also called Inductive Biases. By introducing these subjective assumptions, we are biasing our analysis/model towards certain solutions. It’s something we must be aware of while writing down a model.

MAP Estimation

The following graphs illustrate a few choices we have for the prior on theta:

Img. 12: Prior choices (x-axis represents probability for theta. y-axis represents probability density)

The first graph shows a uniform distribution on all the possible parameters. It does not put any inductive bias on the model, and hence does not prefer any solutions for theta over others. In other words, this is our Maximum Likelihood Estimate. So, we can say that MLE is a special case of MAP when the prior is uniform.

The other three options put some inductive bias on theta. In particular, we can see that all of them are centered around 0.5 with different width (One cookie if you can deduce the decreasing order of the strictness the last three models assume on the prior. ;)). This means that all of the last three assume that the real solution is in a small interval around 0.5, with varying width of the intervals.

Okay, now we have some idea on how our prior should look like. Can we have some more motivations behind the choice of our prior model? One possible motivation could be that we can choose a prior such that it eases the further computations (they don’t get very extensive). This is one of the driving factors for choosing the appropriate prior for our specific model (in our case: the coin flip).

Since prior is just another probability distribution, we’d like to choose this distribution in such a way that it makes our subsequent calculations easier. And it turns out that whenever we are looking for a prior for a parameter that directly corresponds to a probability (like in our case, theta corresponds to the probability of Tails showing up in a coin flip), there usually exists such priors which fit well.

In our case, this prior is going to be a Beta distribution:

Img. 13: pdf of a beta distibution

Beta distribution can be understood as representing a distribution of probabilities- that is, it represents all the possible values of a probability when we don’t know what that probability is. For a more intuitive understanding, see this.

Okay, this looks very byzantine and disheartening at first glance, but stay with me here. Let’s look at it closely; we start to notice terms which we have already seen before:

Img. 14: We have seen an extremely similar form in MLE.

We have seen the component in Img. 14 in MLE, except that the #Heads and #Tails in the exponents have been replaced with a-1 and b-1.

Along with the theta term, we have normalizing constant, which has Gamma functions. A gamma function is nothing more than an extension of the factorial function, with it’s argument shifted down by 1:

Img. 15: Gamma function

The special characteristic of a gamma function is that it allows us to evaluate the factorial at any real value (Gamma function of 1.5 is well defined, whereas 1.5! is not defined).

The fact that we have seen this before in parts is no coincidence. We’ve chosen a Conjugate Prior. Whenever we have a prior that has a similar functional form as the likelihood and they fit well together, we talk about so called Conjugate Priors, and the Beta distribution is a conjugate prior for the likelihood for our coin flip experiment. The interpretation is the following: Without seeing any data, we have a feeling about what the outcome of our coin flip experiment is going to be. In our experiment, if we were to do a+b-2 coin flips, then we can choose the prior such that a-1 of those flips would show up Tails and the remaining b-1 would show up Heads.

So, if we would assume the coin to be unbiased, we’d choose a and b to be equal. And the more certain we are about this assumption, the higher we’d choose a and b.

The following graphs illustrate a few choices of a and b for our Beta distribution:

Img. 16: Four different Beta distributions (x-axis represents the probability for theta. y-axis represents density)

If you wanna play around with different Beta distributions, check this iPython notebook here.

Interpretations of four graphs:

a. Choosing a and b to be 1 (equal) signifies that we have no assumptions about our experiment whatsoever. This is an important feature of Beta distribution: it generalizes to a uniform distribution when a = b = 1.

b. Choosing a < b represents that we are less certain about #Tails than the #Heads. The graph clearly illustrates this by showing that the probability of Tails is more likely to be less than 0.5.

c. We are twice as certain about the flip being Tails than being Heads. As we can see from the graph, the probability of the flip being Tails is biased to be greater than 0.5.

d. Choosing a and b to be non-integers favors extreme solutions. Choice of non-integers for a and b unfortunately does not have any viable physical interpretation, but it gives us a lot of flexibility in modeling the prior distribution.

In a sense, the more certain we are about our prior assumption, the higher we can choose a & b to be, as it is visible from the ‘b’ and ‘c’ graphs in Img. 16.

Now that we have reasoned through various arguments that Beta distribution is a suitable prior for our coin flip experiment, let’s plug everything we know into the Bayes’ formula we used in Img. 8:

Img. 17: Bayes formula for Posterior

We know:

Img. 18: MLE part in our MAP solution

Now you might be thinking that the likelihood we derived before had theta in it:

Img. 19 : Likelihood

The reason we now have x instead of theta is because we have assumed that theta in itself is a random variable and can take on a specific value x.

Since we have already decided that the prior is a Beta distribution, it takes on the following form:

Img. 20: Prior

Plugging these values in the Bayes’ rule:

Img. 21: MAP formulation

The second statement introduces a proportionality sign by leaving out p(D) as well as the normalizing constant of the Beta distribution as both of them are constants w.r.t. x. (Also, leaving out constants does not change the position of the maximum, only it’s value changes.)

Consider the second statement in the above image. We can see why Beta distribution or Conjugate priors in general are powerful because this term actually looks like the likelihood,except that it has these correcting terms a-1 and b-1. So no matter how extreme |T| and |H| are (like in the case where we only flipped the coin twice), we have these correcting terms a & b.

What we are essentially doing here is pulling the result away from the extreme marginal results like we got in the maximum likelihood case with too little data. And the reason why conjugate priors are such an excellent choice for priors is because they have exactly this effect.

Okay, now we are left with the following term:

Img. 22: MAP with proportionality

How do we find the multiplicative constant that turns the proportionality sign into an equality sign?

Let’s take a detour to figure out how to come up with the solutions of specific integrals which are not solvable (or, at least, very difficult to solve) by conventional methods:

Img. 23: A not-so-easy integral

Do you think you can solve this integral on paper analytically? I am not majoring in Maths but to me, the answer is Yes and No. Yes, you can, but it will take a considerable amount of time to arrive at the result by conventional calculus rules, assuming you somehow are able to come up with the solution. No, well, because of the similar reasoning that it takes too long.

Now if you know your probability distributions very well, you would know that the following integral is:

Img. 24: Standard gaussian pdf

a pdf for a standard gaussian with 0 mean & unit variance i.e., X ~ N(0,1), and it equals 1. If you wanna brush up on Gaussians, I think the wikipedia article is pretty solid.

If the above integral equals 1, it is just a matter of rearranging terms by multiplying both sides by sq.rt(2*pi) to come up with the answer to the actual integral we were talking about (Img. 23)., and we see that:

Img. 25 : Answer to the not-so-easy integral

Hence, in a sense, we reverse-engineered our way to arrive at the solution to a considerably difficult integral.

Now that we have a pretty handy trick under our belt, let’s see if we can apply it to find out the proportionality constant for our MAP estimate.

We know that our prior distribution sums up to 1:

Img. 26: The only constraint on prior

AND the RHS in Img. 22 expression for MAP is proportional to a Beta pdf, which integrates to 1. Notice the uncanny similarity to our cool trick?

Consequently, by reverse-engineering, the posterior must also be Beta distributed and the only constant that works here is the normalizing constant of the respective Beta distribution:

Img. 27: Constant of Proportionality for our MAP

There is no other constant that lets the RHS of our MAP estimate integrate to 1. But since it has to integrate to 1, this is the only working constant.

Whenever we are supposed to solve a difficult integral, we examine it closely and try to find if it looks like a pdf. And if it does, we can easily reverse-engineer the normalizing constant of this pdf and can then easily solve the integral.

Img. 28: Note on the reverse engineering trick

An immediate and very nice consequence of this reverse-engineering trick is that we can easily determine the distribution of our posterior. This allows us to know that our posterior needs to be a Beta distribution and we also know the parameters:

Img. 29: Posterior — Beta distribution

Now we can find the maximum of our parameter theta just like we did in the MLE post (I am skipping the actual calculation; it’s a trivial application of log to the equation in Img. 21 to find out the maximum), and we arrive at the following result:

Img. 30: MAP Solution

Comparing this MAP estimate with the ML estimate (|T|/|H|+|T|), we see they are very similar. The term |T|/|H|+|T| pops up again in MAP estimate, except that it is corrected by our prior belief.

Under our prior belief and after seeing the (i.i.d.) data, theta_MAP is the best guess for theta.

It is called the Maximum Aposteriori Estimate (MAP).

Notice the most important thing here: The Posterior works very intuitively. If we have very little data (like in our second experiment where we had only 2 coin flips, both of them being Heads), the effect of the prior is much stronger. On the other hand, for a fixed a & b, if we perform sufficiently many coin flips (for instance, a million), then the effect of a & b is almost negligible and we get very close to the MLE. This is very intuitive because in the low data regime, the prior prevents overfitting whereas if we have a lot of reliable information from the random process, we don’t really need the prior information anymore. We can get all of it from the data itself. In other words, MLE dominates when we are swimming in data.

Is this the end of the process? We have already incorporated our assumptions about the way the world works, the way random processes unfold. What more is left to accomplish? Now we can just sit back and watch the LOTR trilogy.

Or can we? As it turns out, when we incorporated our prior beliefs into the model, we implicitly favored some solutions more than others (by assuming a beta distribution over the coin flip process, we biased the MAP estimate towards a particular solution; a form which looks like Beta pdf). What about all the other priors which can also affect a process, albeit in a very subtle way? Is there some way we can incorporate them into our model too? As you already know, the answer is YES, and we will discuss it in depth in the next post when we talk about Fully Bayesian Analysis of our model.

I realize that this post was not easy to follow at times, there was a lot of maths involved. I hope I made it a little bit easier and intuitive for you to understand the process of MAP Estimation. If you find any specific part too esoteric, feel free to leave a comment, I will attend to it as soon as I can.

Resource : The ML course taught at my grad school, TU Munich. Feel free to watch the lectures here.

If you find this post interesting, please recommend and share it so that others can benefit from it as well.

--

--

“I .. a universe of atoms.. an atom in the universe.” — Richard Feynman