Linear Regression

From A Bayesian View

Rina Buoy
The Startup
7 min readJan 4, 2020

--

Kevin Murphy, the author of ‘Machine Learning: A Probabilistic Perspective’, refers to linear regression as a ‘workhorse’ of statistics and supervised machine learning. In its original form, it can model the linear relationship. When augmented with kernels or basis functions, it can capture the non-linear relationship. It can be used as a classifier by replacing Gaussian output with Bernoulli or Multinoulli distribution. Before proceeding, let’s review some notations used here.

y : a scalar output

y: an output vector

x : a feature vector

X: a feature matrix

1. General Form

The general form of linear regression is, compactly, given by:

w is the weight vector, the first element of which is the intercept (wo).

x is the augmented feature vector, the first element of which is 1.

To model the non-linear relationship, we can replace the feature vector by non-linear basis function.

∅(x) is a basis function.

The polynomial regression of arbitrary d-degree is an example of linear regression with non-linear basis function which has the following form:

2. Ordinary Least Squares (OLS)

The goal is to find the best weight vector which can explain the outputs. We can formulate the cost function as a residual sum of squares between the predicted vs actual output.

N is the number of data points.

If we RSS(w) by N, we obtain the well-known mean squared error (MSE).

By setting the derivative of RSS(w) with respect to w to zero, we obtain the analytical expression of weight vector (w) below:

X is the feature matrix (NxD). N is the number of data points while D is the number of feature dimensions.

y is the observed output vector (N).

This above solution is known as ‘ordinary least squares’.

3. Geometric Interpretation

From the geometric point of view, linear regression is nothing but orthogonal projection. Suppose we have N = 3 point in D = 2.

All vectors have been converted to unit norm

norm

Geometrically, x1 and x2 are vectors in 3D space and together they form 2D plane. y is a vector in 3D space. y_hat is the orthogonal projection of vector y in 3D space onto 2D plane formed by x1 and x2.

4. Probabilistic Linear Regression

From the above figure, the OLS line does not fit all the points exactly. At each point, there is a residual error (ε) between our linear prediction and the true observed output. We can rewrite the general form of linear regression to include the residual error as follows:

Assuming that the residual error is a Gaussian distribution with the mean (μ)and standard deviation (σ). And μ is given by,

Therefore, we can write a probabilistic linear regression as follows:

θ is the parameters which include W and σ, in this case. N stands for Gaussian distribution.

Similarly, we can write the following expression for the case of non-linear basis function.

4.1 Maximum Likelihood Estimate (MLE)

A common way to estimate the parameters of a statistical model is to compute the MLE, which is defined as:

We use log probability for mathematical convenience.

Instead of maximizing the log-likelihood, we can equivalently minimize the negative log-likelihood (NLL).

By inserting the definition of Gaussian distribution into the log-likelihood function:

By assuming that σ is known, maximizing log-likelihood is equivalent to minimizing RSS(w). Therefore, MLE is equivalent to the ordinary least squares (OLS) solution.

However, by working with probabilistic form, we can capture the uncertainty associated with the parameter estimates, handle outliers in observed output and regularize the model. All of these can be achieved by Bayesian learning.

4.2 Bayesian Learning

Bayesian learning for parameters is succinctly given below:

Source

In the context of linear regression, the parameter (θ) is the weight vector, W. D is ((X1,y1), (X2,y2),…,(Xn,yn)).

p(D|θ) is p(y|X,θ).

p(θ) is the prior over θ .

p(θ|D) is the posterior over θ after observing the data (D).

P(D) is the marginal likelihood when is defined as:

The analytical expression for the integral term of P(D) exists only for limited forms of likelihood functions and priors. Often times, evaluating the integral term of P(D) is untractable. We have to resort to sampling methods like MCMC or variational inference.

For a Gaussian likelihood and a Gaussian conjugate prior, the posterior is also a Gaussian. In the context of linear regression, we can express the analytical expression for the posterior distribution as below. For detailed derivations, the reader is referred to here.

Source

For a special case:

We can derive the posterior mean as follows:

Source

where,

The posterior mean, θn is the regularized weight vector. Therefore, by placing a Gaussian prior over θ, we recover ridge regression which is a regularized linear regression. On top of the posterior mean, we also obtain the posterior variance from Bayesian learning.

Here are the posterior mean and variance of the weight vector of linear regression.

Source

4.3 Prediction

Unlike OLS or MLE which can give only a point prediction of y for a given feature vector, x. The posterior mean and variance from Bayesian learning allow us to quantify the uncertainty associated with the prediction.

To obtain the prediction for a given feature vector, x, Bayesians marginalize over the posterior of θ. For a given training data D, the posterior prediction is given by:

Source

5. Code Demo

5.1 1D Simulated Data

We generate a noisy 1D dataset.

The codes to generate noisy data are given below. In the codes below, the number of dimensions is set to 2 because the intercept (wo) is augmented to the weight vector (w). Similarly, we augment the column of 1 to the feature matrix (X).

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(0)ndims = 2
ndata = 30
X = np.random.randn(ndata, 1)
one_array = np.ones_like(X)
X = np.hstack((one_array,X))
w_ = np.random.randn(ndims)
sigma = 0.1
noise_ = sigma * np.random.randn(ndata)
y_obs = X.dot(w_) + noise_
fig, axes = plt.subplots(sharey=True, ncols=1, figsize=(15, 3))
axes.plot(X[:,1], y_obs, 'o', label='noisy data')
axes.set_xlabel(f'x')
axes.set_ylabel('y')
axes.legend()
fig.suptitle('Simulated Data');

5.2 Ordinary Least Squares (OLS)

To estimate the weight vector using OLS is rather straightforward and given below.

w_ols = np.linalg.solve(X.transpose() @ X,np.dot(X.transpose(),y_obs))
axes.plot(X[:,1], X.dot(w_ols), 'r', label='OLS Fit')
axes.legend()
plt.show()

5.3 Posterior Mean and Variance

By using the Gaussian likelihood and prior over θ variances (σ² and τ²) of 0.1 and 1, respectively, we can use the above expressions to estimate the posterior mean and variance (and standard deviation). To visualize the joint posterior distribution of the weight, we draw 1000 samples and plot the joint distribution.

tau = 1.
tau_I = np.eye(ndims)/tau
tmp = (X.transpose() @ X)/sigma + tau_I
w_mean = 1/sigma*np.linalg.solve(tmp,np.dot(X.transpose(),y_obs))
w_var = np.linalg.inv(tmp)w_mvn = np.random.multivariate_normal(w_mean,w_var,1000)
h = sns.jointplot(w_mvn[:, 0],w_mvn[:, 1], kind='kde', stat_func=None);
The joint distribution of weight vector (w1, wo)

5.4 Posterior Prediction

With the posterior mean and variance, we are now ready to make predictions on the data by marginalizing over the weight. Then, we plot the mean prediction along with +/- 1 std uncertainty band.

ypost_mean = X.dot(w_mean)
ypost_var = sigma + np.sum(X @ w_var * X, axis=1)
axes.plot(X[:,1], y_obs, 'o', label='noisy data')
axes.plot(X[:,1], ypost_mean, 'r', label='posterior mean')
axes.fill_between(X[:,1], ypost_mean - np.sqrt(ypost_var), ypost_mean + np.sqrt(ypost_var), color="orange", alpha=0.5, label='+/- std')
axes.set_xlabel(f'x')
axes.set_ylabel('y')
fig.suptitle('Posterior Mean/Standard Deviation')
Mean predictions along with +/- 1 std uncertain band

6. Wrap-up

We cover ordinary least squares (OLS) solution, geometric interpretation, and Bayesian learning of linear regression. We show that by choosing a Gaussian likelihood function and a Gaussian prior over the parameters, we can expression the posterior over the parameters, analytically. The posterior mean is equivalent to the regularized weight estimate of ridge regression.

--

--

Rina Buoy
The Startup

An applied NLP researcher at Techo Startup Center (TSC)