Stats Series #2 — MCMC

Konark Jain
IITG.ai
Published in
7 min readNov 13, 2019
MC Bayes — The dopest reverend there ever was.

Sampling

What is sampling? Sampling is the act of “picking out the n number of balls from an urn”. More mathematically, it’s the process of taking an observation out of a general population with a well-defined probability distribution over the observations.

Now, this probability distribution of your data is almost never available to us, generally, and we would like to figure this distribution out by sampling so many times that we come close to exhausting all the probable observations in the sample space and can generate an approximate probability distribution. This is very readily doable for univariate case i.e. when only one variable is varying. When it comes to multiple variables we kind of run with our tails between our legs to Reverend Bayes as shown above with his famous Bayes Rule:

For this, we now need to know the likelihood i.e. P(A|B) which is, surprise surprise, very difficult!

Ancestral Sampling

So what if you had some information of the variables in the multivariate case? Let’s say they were affecting each other directly and formed a network of ancestors.

An ancestral network for the variables x(i)

An example can be seen here. We can then calculate all the likelihoods easily using this diagram and just use sampling in a forward sense. For example, here we can just go with sampling 1 & 2 then 3 then 4 & 5 and finally 6. And we get:

Pretty simple right? In many cases, we are able to build such belief networks as they are popularly known as but ancestral sampling has an obvious problem.

What if we had some conditions over a specific variable? For example, let’s say you’re planning a trip and the X3 guy will only come if X1 and X2 are coming, X4 and X5 won’t come if X3 is not coming and X6 wants X4 and X5 to make it to the trip to even considering paying a penny. Now, what if we wanted to check the scenario that what if X6 is definitely going. What is the scene now? How will the trip bus now look like?

Well to be short, it gets complicated.

What a mess, right?

This situation is known as having evidence and X6 is known as an evidential variable. Ancestral sampling is plain stupid when sampling with evidential variables.

Gibbs Sampling

Josiah Gibbs, among his seemingly endless contributions to physics and chemistry (ah when true polymaths walked this earth!), invented this method of sampling.

The idea is simple. We clamp all the variables but one and sample that variable. We, therefore, get a totally new sample than the previous one. We then continue this for all variables. Therefore successive samples differ only in one variable. How is this sampling possible? Well, if you clamp all the variables except one, you’re essentially sampling from the distribution:

which is a univariate distribution and easy to sample from!

It’s quite simple but brilliant. The problem of evidential variables is solved pretty sweetly: we just don’t do probabilistic sampling for the variables which have evidence.

As you can imagine, Gibbs sampling is extremely slow to move. Also, it is quite useless when it comes to highly correlated variables.

Monte Carlo Markov Chain (MCMC)

We generalize the above method in a much stronger sampling technique known as MCMC.

I am writing this blog as a precursor to the research paper I published in the Neurocomputing journal some time ago. It can be viewed here on the journal’s website and here on ArXiV. It seeks to improve MCMC and I like to call it MCMCMC (for multi chain MCMC) but it officially is called Parallel Tempering MCMC. Anyway, the following math is extremely clever and this is the most exciting part of the blog.

So! Let’s begin.

Let’s say we know what our multivariate distribution is in the sense that we know how it looks like.

So here it is. x is a vector. We know p*(x) but not Z which is an intractable constant of integration. We often have this situation in practice. We can just assume the variables to come from a normal distribution and start building the conditional distributions from then on. We will now work on another distribution q_∞ which will converge to p(x) at the end of eternity.

This is always possible since an irreducible Markov Chain has the property of having a stationary distribution. q(x’|x) is the conditional distribution that can be extracted from p(x) using this MCMC technique. So basically what we want is to get this q(x’|x) somehow and just use that to do forward sampling. Since the above equation holds, we can then conclude that the samples we get are indeed from p(x) and we have solved the problem at hand! We don’t need to calculate Z and we can still sample from p(x).

Now we further simplify the task of attaining q(x’|x) by noting that:

and therefore:

This is known as the detailed balance condition and we will be using it next. Now let’s take a step back and look at what we really want to do. Sampling as explained earlier is making an observation out of an intractable PDF. So we have an initial guess on the PDF’s shape in our hands: p*(x) and we have our detailed balance condition and that sampling from q(x’|x) is equivalent to sampling from p(x). We now want to calculate this q(x’|x) from a proposal distribution (q̅(x’|x)) that we very often calculate using another model. In my paper(referred above), we use the likelihood function of a simple neural network as the proposal. Let’s now construct q from q̅.

Often in real life, we need a person to pull our shit together for us since we are just so incapable of doing it ourselves. Say we have a positive function f(x’,x) ranging in [0,1] that does this shit pulling for our . Let’s take notation for a ride and just for simplicity I will propose this: q = q̅ f . But we need:

So let’s modify our proposal a tad to achieve the above. Observe, with some conditions of f(x’,x) which we will find later (in the detailed balance satisfaction para below), p(x’)

So let’s use this observation to satisfy our needs. What if we take this as our q(x’|x):

It is derived from the above observation and also satisfies the stationarity condition. Now we need to find f. How do we do that? Well, we have our detailed balanced condition equation left to exploit, don't we? We need

If can remove the integrands and use the following definition of f, known as Metropolis-Hasting acceptance function, it satisfies the equation above.

since:

Now we have formulated our f and q given q̅ and p*. Great. Now, what do we do with those? Where is the actual sampling happening? We can interpret the equation of q(x’|x) in this way.

The first half before the + sign can be stated as drawing x’ from q̅(x’|x) and accepting it with a probability f(x’, x). The second half is when our sample is rejected and the Dirac delta function becomes = 1 or in other words, x’ = x. That is, we use the previous value itself as our new value.

And we’re done. Phew. It was an immense amount of effort to write this blog with all the math involved and everything. Anyway, this algorithm can be summarised in this figure below:

A programmer reading this blog can skip the beautiful math above this algorithm and just use this 13 line beauty for their MCMC application.

Reference: Bayesian Reasoning and Machine Learning by David Barber

--

--