The Ultimate Guide for Linear Regression Theory

11 min readMar 5, 2019

Linear Regression is commonly the first machine learning problem that people interested in the area study. For who have some experience with ML, sometimes this technique is boring, due to its simplicity (and, of course, limitations). I had this opinion until I decided to revisit the math behind ML models to construct a solid base of knowledge and then faced the Linear Regression technique again. For my surprise, this technique has a solid and interesting math formulation that most times we don’t see. In this post, I want to detail such ideas and I hope it will help you to understand the whole formulation instead of just applying it.

First of all, for the sake of conciseness, I will not describe the problem: I consider that the reader already has a basic understand of such technique and its application as a squared loss optimization problem solved by gradient descent. If you do not have such knowledge, I recommend this lecture (and the subsequent ones) from Machine Learning MOOC course by Andrew Ng. Actually, in this post, I will describe the mathematical formulation of Linear Regression in the perspective of Maximum Likelihood Estimation (MLE), Maximum a Posteriori (MAP) and Bayesian Linear Regression (which means we will not use gradient descent — but closed analytical solutions that solve linear regression problems).

Problem Formulation

Given the supervised learning problem of fitting the dataset consisted of pairs (x,y), we consider a regression problem with likelihood function

It means that we have the following functional relationship between the data:

where epsilon is a gaussian variable with zero mean and variance σ², an i.i.d. measurement noise. The objective is to find a predictor function that is similar to this true data generator function.

In the Linear Regression formulation, as a parametric model, we consider that such function is a linear combination of parameters and the data vector:

It is important to mention that we consider the bias parameter as a element of the θ parameter (and, hence, we concatenate a “1” at the end of data vector). This formulation facilitates the math description.

Therefore, our linear regression model is given by:

where such probability density function is called likelihood, and the whole uncertainty of this model is due to the observation noise epsilon.

Based on this formulation, it is clear that our mission is to find the best parameter vector θ that spans the best predictor function. We will derive linear regression equations by the view of parameter estimation and using bayesian statistics.

Parameter Estimation

Considering our training set 𝔻 as composed by inputs 𝕏 and targets 𝕐, the likelihood of our training set can be mathematically described as:

By formulation, our data is i.i.d. In a probabilistic view, it means that we can factorize the equation above as a productory of each data sample:

In the parameter estimation viewpoint, the training process aims to estimate a single point value for each parameter using the knowledge of our training set. In this post we will describe the parameter estimation by two techniques: Maximum Likelihood Estimation (MLE) and Maximum a Posteriori (MAP).

Maximum Likelihood Estimation

In MLE, we find the parameters θₘₗₑ by maximing the likelihood:

Intuitively, maximizing likelihood means that we will vary the parameter vector in a way to “enlarge” the probability of the predictor function outputs the observation y for each corresponding input vector x — which is just what we want.

Instead of using the productory form of the likelihood, we apply a log transformation for two great reasons: facilitate derivative calculations and avoid underflows due to several products of small decimal values. As the log function is strictly monotonic increasing function, we do not change the optimization critical points. Additionally, we are familiarized with minimizing cost function. So let’s convert it to a minimization problem as well. Therefore:

This is the so important idea of “minimizing negative log likelihood” in machine learning problems! More than that: if we consider our problem formulation of likelihood as a gaussian variable, we can derive:

Yes! In the context of linear regression, if we formulate our likelihood as a gaussian distribution, maximize such function is similar to minimize the squared error! Very nice, isn’t it? This is also the reason why we use MSE loss in the gradient descent.

We can find a analytic solution for this problem. This loss function is quadratic in θ, so we can conclude there is only one global optimum. Therefore, we can find the global optimum computing its gradient and setting to zero:

It is worth mention that the zero gradient is a necessary and enough condition to this critical point be a global minimum due to Hessian analysis. In fact, we have:

As X is a full-rank matrix, then the Hessian above is a positive-definite matrix.

Before we dive into the code, it is important to mention that linear regression means that there is a linear relationship between the parameters and the inputs. Hence, a polynomial regression can also be considered a linear regression, where:

and:

I implemented this analytic solution for Linear Regression in ml-room (give a star if you like it!):

I also coded a gradient descent linear regression to compare the results of each regression. You may see all the tests cases in the main.py file.

Comparing a simple case where we would like to fit a linear function, we have for the case of 1000 epochs of gradient descent the following result:

Epoch: 999
Cost: 24.748796723940572
Gradients (W, b): 0.007573801056179228, -0.03123069273204225
Weights: [[-4.10382742]], [[7.29646987]]
train: 0.41993188858 seconds.
predict: 1.19209289551e-05 seconds.
Squared Mean Error: 24.748685809688233

On the other side, for the case of MLE analytic solution:

Test Linear Regression MLE: 
train: 0.000187158584595 seconds.
predict: 2.40802764893e-05 seconds.
Squared Mean Error: 27.063798011530853

Well, in both cases the error is similar (this variation is probably due to the high noise). However, the analytic solution is much faster during training! Of course, we can tune better the hyperparameters of gradient descent optimization, but it will not surpass the perfomance of MLE solution during training.

Linear Regression MLE solution for y = -4*x + 7 + noise*2.0

We could also estimate the noise variance of the dataset analytically. We just need to consider the σ as part of the model and derivate w.r.t it and find the global maximum again…

Finally, there is a geometric view for the Maximum Likelihood Estimation. If you analyze the closed form of MLE for Linear Regression, we have:

We are looking for the solution of y = Xθ, and the reconstruction of training targets is given by:

This formula means a orthogonal projection of y onto the one-dimensional subspace spanned by X. Actually,

is the projection matrix. Therefore, the MLE provides the geometrically optimal solution by finding the vectors in the subspace spanned by X that has the minimal squared distance to the corresponding targets — this is the orthogonal projection. The figure below illustrates this idea.

MLE as orthogonal projection. Source: mml-book.

Maximum a Posteriori Estimation

MLE is a great parameter estimation technique for linear regression problems. However, it is prone to overfitting. This problem is clear when we talk about polynomial regression. If we have enough parameters (i.e, enough capacity of representation by our prediction function), the MLE will generate parameters with high magnitude which will span a function that will oscillate wildly and pass through each data point, generating a poor representation of our data generator function. In ML, we don’t just need to minimize the error, but also achieve a good generalization.

In order to mitigate the effect of huge parameter values, we can place a prior distribution p(θ) on the parameters, which will encodes the range of possible values. From the Bayes’ theorem:

In MAP estimation, instead of maximize the likelihood, we maximize the posterior function p(θ ∣ 𝕏, 𝕐). This function encodes the information from the likelihood and the prior distribution. In order to find the solution, we follow the same log transformation:

We consider const everything independent of the parameters. Again:

We also consider the prior a conjugate gaussian distribution, which means that the posterior distribution will also be a gaussian (check about Conjugancy distributions). Therefore:

Calculating the gradient and setting it to zero (the conditions of positive definiteness remains the same):

Considering MLE, we can see that the difference between both estimations is the term that relates the variance of observation noise and the variance of our prior distribution.

Now, focus on the “loss” function of MAP. In comparison to the MLE, we just added a term:

We basically added a regularization term! Hence: when we consider a L2 regularization in a loss function, we are placing a prior distribution to our parameters! This distribution defines where the parameters are, which avoid huge parameters by shrinking them. It is also possible to derive that the λ term, commonly used in linear regression, is relative to the variances from the observation noise and prior distributions.

You can see the implementation of MAP estimation below.

As a test scenario, I tried to fit a sinusoidal function by a polynomial of degree 9. As you can see, the MLE has a poor representation, because it has high variance (you can check seeing the weird shape trying to pass through the data points). On the other side, the MAP estimation has a shape more similar to the trigonometric function — that’s the regularization acting!

Linear Regression for y(x) = -4.0sin(x) + noise*0.5

Here we finalize the parameter estimation section. But before finalize our Linear Regression post, let’s derive another solution by a Bayesian viewpoint.

Bayesian Linear Regression

In Bayesian Linear Regression, instead of computing a point estimation of parameter vector, we consider such parameters as random variables and compute a full posterior over them, which means an average over all plausible parameters settings.

We consider the model:

where ϕ(x) is the generalized input vector.

We will derive the posterior distribution directly (instead of talking about prior inference) and look for this model in practice. The posterior over the parameters using the Bayes’ Theorem is:

From our model, we already have the prior and likelihood distributions. The distribution p(𝕐 ∣ 𝕏) is called marginal likelihood and is calculated by:

This distribution is independent of the parameters and ensures the posterior is normalized. We can think of the marginal likelihood as the likelihood averaged over all possible parameter settings, accordingly to the prior distribution.

Using those three distributions and the Bayes’ Theorem we found that the posterior distribution is:

I preferred not to derive mathematically this result here for the sake of conciseness. But the process is about applying the log transformation to the Bayes’ expression, do some matrix algebra and then identify the mean and covariance matrix. Furthermore, we exploit the fact that the marginal likelihood is independent of θ and, therefore, its term is just a constant after log transformation. For detailed derivation, check here.

Now that we have the posterior distribution, we need to compute the posterior prediction:

This expression will give us the distribution over the observation space — the uncertainty over the prediction. Basically, is the region where the prediction lies. To calculate such distribution, we integrate over the posterior distribution:

The derivation of this formula follows the idea of conjugacy of Gaussians and the marginalization property of Gaussian distributions.

I implemented the code for this analytical solution:

As test case, I use the same case of polynomial regression previously analyzed. As result, we have:

Bayesian Linear Regression for y(x) = -4.0sin(x) + noise*0.5

In this plot, the scatter plot refers to the input data points. The blue line is the mean of the distribution estimated for each data point. The blue shaded area forms the 95% confidence bounds. As we can see, our result is not just a single line of predictions (resulted from point estimates of the parameters vector), but actually it is a “region” estimation (resulted from distributions for each parameter).

In this, post, I explained some of the most important statistics concepts for ML theory using Linear Regression. This technique is great for this explanation because of its closed analytical form when using gaussian distributions. In other cases, we need to solve harder optimization problems, and for that we commonly use gradient descent optimization.

If you would like to have more details about what have been detailed here, I recommend the Mathematics for Machine Learning book, which is a great source of knowledge about the topic!