Mathematics and Machine learning

Gentle introduction of Bayesian Linear Regression

Mathematical background behind the Bayesian linear regression with Gaussian prior and practical implementation

Yuki Shizuya
Intuition

--

Bayesian linear regression visualization adopted from Bishop, Pattern Recognition and Machine learning

When we apply linear regression in the frequentist view, we try to find the single most plausible model parameters. However, since real-world data is very complex, fixed parameters are sometimes not enough to explain it. Then, what if we consider the model parameters more flexible? Doesn’t that sound good? We can implement this idea by applying the Bayes theorem to linear regression [1]. In this blog, I will introduce the mathematical background of Bayesian linear regression with visualization and Python code.

Table of Contents

  1. Overview of Bayesian linear regression
  2. Mathematical background of Bayesian linear regression
  3. PyMC implementation using real-world dataset

1. Overview of Bayesian linear regression

Bayesian linear regression is another type of linear regression applied to Bayes’ theorem. The main difference between Bayesian and Frequentist linear regression is that the former tries to infer the distribution of parameters, and the latter derives fixed parameters. Let’s see the visualization to understand intuitively. We assume the simple linear regression shown below.

Simple linear regression setting

We can derive parameters by Frequentist and Bayesian ways.

Visualization to derive linear regression parameters in Frequentist and Bayesian ways

As you can see, frequent linear regression has fixed parameters, so we can get the exact solution based on input data. On the other hand, in Bayesian linear regression, we define arbitrary probability distributions for each parameter. In the example case, we use a Gaussian distribution for all parameters and predicted values. Bayesian linear regression considers various plausible assumptions about how input data is generated. The advantages of Bayesian linear regression, and by extension, common Bayesian analysis, are as follows.

  • Just like ensembles, we can also infer the predictions by averaging over lots of sampled parameters.
  • We can know the confidence of predictions.
  • We can focus on uncertain space and add data based on it.
  • We can make more robust estimations, especially with the smaller dataset.

In contrast, the common disadvantages are below.

  • It tends to need a lot of computation.
  • If we are not careful about selecting the probability distribution(prior and likelihood), the conclusion may be wrong.

When we have a lot of data, we can use other methods like deep learning. However, we usually don’t have enough data to use such techniques in some industries, such as manufacturing and health care. Then, Bayesian analysis comes to the one answer.

Now that we understand the usefulness of Bayesian linear regression. Let’s dive deeply into the mathematical background of it in the next section.

2. Mathematical background of Bayesian linear regression

Before we discuss Bayesian linear regression, let’s briefly recap Frequentist linear regression. The formula of Frequentist (multiple) linear regression are shown below.

Multiple linear regression formula

We can derive parameters by minimizing the least squared error:

Parameters derived by Least squared error

Note that we assume the Xᵗ X is a full-rank matrix, which means it has an inverse matrix. The least squared error method is a very intuitive approach to minimize the difference between given data and predicted values. The last equation is the derived parameter for the Frequentist linear regression. There is only a fixed parameter. If you want more details, you can check my previous blog [2].

Based on this, how can we interpret the linear regression in the Bayesian view? Here is the Bayes’ theorem.

Bayes theorem

When we apply Bayes theorem to linear regression, the Bayes formula will be:

Bayes formula applied to linear regression

likelihood refers to the probability density values of the residuals given parameters, prior refers to the probability density values of the parameters, and posterior refers to the probability density values of parameters given the residuals. When we calculate this formula, we need to define the probability distribution that each term follows in advance. Although we can define any probability distribution for each term, there is no guarantee to convergence. So, we generally assume the Gaussian distribution for each term. We will find its convenience later in this section.

Likelihood and prior of Bayesian linear regression

For prior distribution, we assume the Gaussian distribution as a weakly informative prior. For likelihood, we want to minimize the residual, so we assume the Gaussian distribution with mean 0 and standard deviation 𝜎. Now, let’s derive the posterior distribution based on the above assumption. We take a logarithm for all terms to compute easily.

The derivation of posterior distribution

The last equation is a multivariate Gaussian distribution. The normalization term is included in the constant term.

The notation of Posterior distribution

Based on this derivation, Gaussian prior leads to a Gaussian posterior, so Gaussian distribution is the conjugate prior for linear regression. The conjugate prior is very convenient that they can give closed-solution of posterior. You can check more detail in this post [3]. When we get more data, we use the posterior as prior and update the distribution. When we want to inference based on new data, we just sample the parameters and substitute them in the linear regression equation.

In this case, since we assume a Gaussian distribution as a prior, we can calculate the posterior distribution. However, we generally cannot calculate the posterior distribution in a closed-solution. In such a case, we need to use the MCMC(Markov chain mote Carlo) method to sample for one answer solution. There are already some libraries to treat with such general situation, so we should use them in a practical use case. One of them is PyMC library.

So far, we understand the mathematical background of Bayesian linear regression. The next section will learn how to apply PyMC to Bayesian linear regression.

3. PyMC implementation using real-world dataset

In this section, we will use PyMC to implement Bayesian linear regression. Based on this guide [4], you can install PyMC via Conda. I ran the program below on Mac M1 Pro (Ventura 13.4).

conda create -c conda-forge -n pymc_env "pymc>=5"
conda activate pymc_env

We will use the student performance dataset [5]. Here is the first five rows of this dataset.

The head lines of the student performance dataset

This dataset contains four independent variables (”Hours Studied”, “Previous Scores”, “Extracurricular Activities”, “Sleep Hours”, and “Sample Question Papers Practiced”) and one dependent variable (”Performance Index”). The below graph is the scatter matrix.

Scatter matrix plot of the student performance dataset

The performance index follows a Gaussian-like distribution, so choosing a Bayesian linear regression model may not be a terrible choice. Then, we assume the parameter and dependent variable follow the Gaussian distribution.

The setting for the student performance dataset

Note that we assume the residuals follows the Gaussian distribution with mean 0 in the previous section. In this case, we assume the dependent variable follows the Gaussian distribution with mean the predicted value from linear regression for convenience. It seems different but is equivalent (you can check when you formulate both settings).

When you use PyMC, you can implement Bayesian linear regression by only a few line of codes.

test_score_bayesian = pm.Model(coords={"predictors": columns})

with test_score_bayesian:
# posterior variance
sigma = pm.HalfNormal("sigma", 25)

# beta
beta = pm.Normal("beta", 0, 10, dims="predictors")
beta0 = pm.Normal("beta0", 0, 10)

mu = beta0 + at.dot(X_train, beta)

y_hat = pm.Normal("y_hat", mu=mu, sigma=sigma, observed=y_train)

Isn’t it very easy? You can also sample based on your implemented model, and plot variables’ distributions.

The distribution of each parameter

Here is the coefficients obtained from Bayesian linear regression model and frequentist one.

Coefficients obtained from Bayesian linear regression
Coefficients obtained from frequentist linear model using statsmodel

As you can see, we can obtain the almost same result. The reason is that we assume the Gaussian distribution for prior, which is also implicit in the frequentist setting. When you have an additional information about prior, you can freely change the prior distribution and get more plausible result.

Finally, I paste the gist link to be able to reproduce the results.

Bayesian linear regression Notebook

In this blog, we have viewed the mathematical background and practical implementation of Bayesian linear regression. If we use PyMC, we can quickly implement own Bayesian linear regression model. If I missed something, please let me know. Thank you for your time to read this blog!

References

[1] CSC Lecture 19: Bayesian Linear regression, Toronto university

[2] Shizuya, Y., Gentle introduction of Multiple linear regression, Intuition

[3] Ms Aerin, Conjugate prior explained, Towards Data Science

[4] https://www.pymc.io/projects/docs/en/stable/installation.html

[5] Student performance dataset, kaggle

--

--

Yuki Shizuya
Intuition

Data Scientist in Japanese IT company. I write blogs about machine learning/deep learning/statistics. https://www.linkedin.com/in/yuki-shizuya-8b4108287/