Mastering Bayesian Linear Regression from Scratch: A Metropolis-Hastings Implementation in Python

Fortunato Nucera
9 min readJun 4, 2022

--

A couple of months ago, before the beginning of the course in Bayesian Computation I am currently taking, I was looking to get a deeper understanding of the Markov Chain Monte Carlo (MCMC) simulations. Back then, I used to fire up my Jupyter notebook, import PyMC3, and build the model, ending it all with a call to the Metropolis optimizer, without really understanding what was going on.

The intrinsic beauty and elegance of the method lies beneath several dozens of concepts and definitions, which makes the entire study an ordeal and prevents the average student like me from appreciating what is really happening under the hood. In this article, I would like to provide an easier introduction and practical usage case of MCMC through the Metropolis-Hastings (MC) algorithm. In particular, I will focus on inferencing the parameters for a simple linear regression model. Saying “simple” does not do justice to the depth of the idea behind it, however. The replacement of the normal distribution with any other conditional distribution (Poisson/Gamma/Negative Binomial or others) would allow for an almost identical implementation of GLMs through MCMC, changing 4 or 5 lines of code at most. Also, since the samples for the inference come from the posterior distribution for the parameters, we can use these to build confidence intervals for the parameters quite naturally (saying “credible intervals” is more correct in this case).

This article will briefly describe why MCMC methods are used, it will offer an implementation of the MH algorithm for a Linear Regression model and it will conclude with a visualization of what is really happening when the algorithm is looking for the set of parameters which generated the data.

I was inspired to write this article by a lecture about the subject by Professor Nick Heard.

1. Why MCMC?

Imagine you are a cat — well this is not really necessary but bear with me — and you are surrounded by grass. Something in the distance catches your attention: it’s an ice-cream. On top of a mountain. Well, not really on top of a mountain, but at the very top of a group of mountains, which is surrounded by a forest, and the mountains blend on each other and are hidden by some very opaque clouds that prevent you from seeing through. How do you get to the ice-cream?

@rextheartistecat on Instagram drew this beautiful illustration for me! He’s super talented, go check his page!

You look for the Aztec golden coin you found in the jungle. And then you decide that every time you want to move a step, you will flip your coin twice and follow the rules below:

  • if you flip head-head, you move a step forward.
  • if you flip tail-tail, you move a step backward
  • if you flip head-tail, you move a step to your right
  • if you flip tail-head, you move a step to your left

Is this a reasonable strategy to reach the ice-cream? I would argue it is not. The problem here is that the strategy is not informed. That is, we do not have an estimate of how likely the next step will take us closer to the refreshing goal. MCMC methods, and in particular MH, use a likelihood function to determine whether to accept or reject a move, and introduce stochasticity so as to avoid accepting all the steps that seem to be getting us closer to the goal.

In our next example, which involves the implementation of a regression model, the ice-cream represents the value of the parameters we are trying to infer on. The details are provided below.

2. Problem Setup

Let Y and X be a response and a predictor for the model, respectively. A linear model states that the conditional distribution of the response given such predictor is normal. That is:

for suitable parameters a (the slope), b (the bias), and σ (the intensity of the noise).

Our task is inferring a, b, and σ.

“Well, we need to have prior knowledge of how these are distributed though”, you would say. Well, yes and no. You need to set some “ground rules” which the model needs to follow. In our case, a, b can be pretty much anything, positive or negative, but σ must be strictly positive (I have never heard of a normal distribution with negative standard deviation, have you?). Other than that, no other rules are necessary. The asymptotic normality of the posterior distribution guarantees that, as long as you have enough samples from the posterior, these will be normally distributed no matter what (if the Markov Chain attains its stationary distribution).

In this case, let’s generate synthetic data for the regression, using parameters a=3, b=20 and σ=5. This can be done in python with the following code:

The synthetic data can be seen in the image below:

Figure 1: The synthetic data which will be used to train the Linear model

The next section will involve defining the functions for the Metropolis Hastings algorithm and a looping for a set number of iterations.

3. The Algorithm

Let θ=[a,b,σ] be the vector of parameters at the previous step of the algorithm and θ’ be a proposal for a new set of parameters, MH compares the Bayes Factor (the product of likelihood and prior) between two competing hypotheses for the parameters (θ’ and θ) and scales this factor by the reciprocal of the conditional proposal distribution. This factor is then compared with the value of a uniformly distributed random variable. This adds stochasticity to the model and makes it possible for unlikely parameter vectors to be explored and for likely ones to be discarded (rarely).

Now, this sounds a little convoluted, doesn’t it? Let’s start by building the ideas step by step.

3.1 The Proposal Distribution

First, we define a proposal distribution g(θ’|θ): this is a distribution for fixed parameters at the previous time step. In our example, a and b can be both positive and negative, therefore a natural choice would be sampling them from a multivariate normal distribution centered at the previous iteration step. As for σ, we can for example sample it from a Gamma distribution. It is really up to the user defining all these distributions. A better approach (which we will not cover here) would involve sampling σ from an inverse Gamma distribution instead.

As a result, we can define a proposal distribution in the following way:

Eq 1: Proposal Distribution for a and b

And:

Eq 2: Proposal Distribution for σ

The parameter k is used to control the “spread” of the distributions around their means. 𝜔 is an extra adjustment for the Gamma. The larger k𝜔, the bigger the spread for the gamma (we are using the shape-rate formulation of the Gamma distribution, be careful as scipy forces us to use the shape-scale formulation instead). The function is below:

You can notice how the function proposal contains two arguments: prec_theta represents the parameter vector at the previous step and search_width represents the region where we will be looking for proposals. Looking for a good set of parameters introduces a trade-off between exploration (the search for new parameters set in unexplored areas) and exploitation (refining the search closer to the areas where good sets of parameters have been found). As a result, search_width should be handled with extreme care. A large value of search_widthwill prevent the MCMC from converging to a stationary distribution. A small value may prevent the algorithm from finding the optimum (optima) in a reasonable amount of time (more samples will need to be drawn and longer burn-in period would be expected).

3.2 The Likelihood Function

The likelihood function is derived from the choice of Linear Model. The conditional distribution of the response given the parameters is normal. In other words, we will be calculating the likelihood for a normal distribution where the mean is the product of the predictors and the coefficients a and b and where the noise is σ. In particular, in this case, we will use a log-likelihood instead of raw likelihood, in order to improve numerical stability. As a result, products will simply turn into sums:

3.3 The Prior

We do not need to spend a lot of words on this. We can really choose anything we like in term of distribution for the prior. All we need to do is to make sure that the zone where the inferred-to-be parameters are likely to be found have non-zero prior, and that the noise parameter σ is non-negative. Again, the prior is expressed in terms of a log-pdf.

3.4 The Proposal Ratio

The proposal ratio is the likelihood of observing the old parameter vector given the new one divided by the probability of observing the new parameter vector given the old one. That is, from section 3.1, g(θ|θ’)/g(θ’|θ). Again, we will employ the log-pdf so as to have uniformity of scales across probabilities and to achieve better numerical stability.

3.5 The Final Touch

We are almost there. All we need to do know is putting it all together. In pseudocode:

And, in Python:

Awesome! We have made it! Mind that the code above will take some time to run. You can try and modify N to get less samples or edit the width in order to speed up the exploration (even though you fill exploit more and may diverge in your search).

4. Results

The results are included in the picture below.

Figure 2: a(red), b(blue) and σ(purple)
Figure 3: The histograms of the samples for the parameters.

As we can see, the history of the samples is pretty stable. The mean and standard deviation for the samples’ histories is (after discarding the first half of the history, as a burn-in):

Table 1: mean and standard deviation for the parameters.

Well, there is one problem, which I will only mention here. The samples are highly correlated and, as a result, extra care might be needed when estimating credible intervals. One solution consists in thinning the histories by keeping only a small percentage of parameters (for example, keeping only 1 in 10 accepted adjacent proposals and discarding the rest).

5. How do these results compare with traditional Linear Regression?

We can perform the exact same calculation using statsmodels.api.OLS

With the following result:

Table 2: statsmodels.api.OLS output

The mean for the coefficients of both a and b is very close to the MCMC which can be found in Table 1. The same can be said for the standard error, further reinforcing our ideas about the validity of our MCMC implementation.

Mind that this is not the best solution, but just a solution. Better choices of prior and proposals do exist and are recommended.

6. What do the iterations look like?

I have generated a GIF of the MH iterations. Unfortunately, in 3D the visualization is rather messy, so I decided to focus only on the slope a and on the bias b.

In the visualization:

  • the red dot represents the current proposal
  • the gray area around the red dot represents the proposal distribution (normal) at 3 standard deviations from the mean. The proposal has a diagonal covariance matrix, which is why we obtain a circle instead of an ellipse.
  • The electric blue lines represent the rejected moves.
  • The red lines represent the accepted moves
  • The sea blue dot represents the average value for the parameters obtained from statsmodels.api.OLS .

Medium does not allow to upload large GIFs, so I had to compress the amount of information and show the behavior of MH once every 10 steps.

I hope you enjoyed the article! Cheers!

Please follow me if you liked this article 🙏

--

--