Linear Regression

Nil ☿
9 min readMay 26, 2022

--

This is the pt.2 of a series which will guide you through all the linear models of machine learning, from the math theory to writing the algorithm from scratch in Python. Check out the pt.1 for a brief introduction to linear functions!

Here we start looking at some frightening math to thoroughly understand: 1) the general math behind linear regression, 2) one of the reasons why data scientists and staticians are obsessed by matrices. It may seem pointless to look at the math for a beginner self-taught data scientist, but actually it isn’t… and you’ll see why when we apply all those math notions in the code!

Remember:

A linear model is nothing more than a linear function whose parameters are estimated from data using a loss function through the gradient descent algorithm.

Definitions

Let’s start with some jargon…

The linear regression can be classified depending on:

The number of independent variables x:

  • if number of x = 1: simple linear regression
  • if number of x > 1: multiple linear regression

The number of dependent variables y:

  • if number of y = 1: univariate linear regression
  • if number of y > 1: multivariate linear regression

… and remember:

  • x = independent variable = explanatory variable
  • y = dependent variable = response variable

N.B., since bold notation often is not clear on screen, I’ll use the arrow notation for vectors. Moreover, hat notation will be used for predicted values, therefore:

Let’s analyze the math to see some patterns…

Simple Linear Regression

(function with a single explanatory variable and a single response variable)

The (univariate) simple linear regression is a simple function which maps a single dependent variable to a single independent variable:

You should be already familiar with it from pt.1 of this series. It’s just a line! Now, for each observation i, the function assumes the form of:

Where:

  • β₀,β₁ are the parameters (aka weights): that is, the variables that we need to “tweak” in order to fit the model.
  • ϵ is the disturbance term (aka error variable): that is, an unobserved random variable that adds “noise” to the model.

This formula should look familiar, too! Just imagine if we had called b the term β₀, m the β₁ term, and got rid of ϵ, the concept is indeed actually the same as y=mx+b, it’s just the function of a line, but written in a different way.

Until now we have spoken about a single observation, but the model is actually fitted to many observations, so we can stack each observation with matrix notation. This has also a practical meaning in computer science (vectorized parallel computation performs way better than endless serial iterations!). The matrix notation is:

If you feel confused about the ones… just review the dot product rules:

The matrices above can be simplified in another compact (and human-readable) matrix notation:

We’ve seen a lot of math, but what does this actually mean? What do those vectors and matrices represent? Well, the vectorial notation is just used to group all the predictions, each of which is mapped to their explanatory variables. This means, we’re just talking about lines! But… shouldn’t they be many lines? Well, no. We use multiple observations to fit the best line, this means that even if we “try” multiple lines, at the end with simple linear regression, we get only one simple line in a 2D space.

Take a look at this picture. The dots are the actual values, the line is just the best fit to those data.

Multiple Linear Regression

(function with multiple explanatory variables and a single response variable)

The (univariate) multiple linear regression is a function which maps many independent variables to a single dependent variable:

It may seem intimidating, but it really just means that we have multiple x values, that is, multiple features, that once multiplied with the parameters and summed up will give us our prediction!

For each observation i, the function assumes the form of:

If you remember a bit of linear algebra (the same bit as above ;) ), you’ll know that we can write this as a dot product, in particular:

Notice that to avoid having to manually write a column of 1, I started indexing x from 0, implying x₀=1.

Wait… what’s that “T”? It just means that we take the transpose of the matrix. That’s a fancy way to say we swap the rows of the matrix with the columns. This is needed in order to be able to get a dot product between 2 vectors or matrices.

Therefore, each observation i can be rewritten in the more compact notation:

Now, as for simple linear regression, we stack all the observations in matrix notation:

But since the x terms are just transposed column vectors, this means they’re just the rows of the new matrix!

Luckily, we can use compact matrix notation here too:

Yes, yes, we got all this strange matrix stuff, but what’s the actual meaning here? It cannot be a line with all those different βx terms! Indeed it isn’t. Let’s take the simplest example of multiple linear regression:

It’s almost the same as simple linear regression. It just got one extra term. In simple linear regression we got 2 variables: x, y. This means we were in a 2D space. Here we got 3 variables: x₁, xy. This means that we are in a 3D space. In simple regression we got a line, here we get, guess what? A plane:

But… what if we got one more term?

Here we need to introduce another concept to generalize it to n more terms. This is another fancy word defined with even stranger words, but it’s actually pretty simple: the hyperplane.

Hyperplane: a subspace of dimension n-1 relative to its ambient space.

That in common words means just an object having one dimension less of its space, such as: a point on the number line, a line on a 2D plane, a plane on a 3D space, etc…

So, actually what we get with linear regression, whether, simple or multiple, is just a hyperplane of n-1 dimensions, where n is just the number of dimensions of its ambient space. 2 variables x, y? 2D space, 1D hyperplane (a line). 3 variables x, y, z? 3D space, 2D hyperplane (a plane).

Multivatiate Linear Regression

(function with multiple explanatory variables and multiple response variables)

Multivariate (multiple) linear regression it’s a function which maps many independent variables to many correlated dependent variables:

For each observation i, the function assumes the form of:

As seen above, each observation i can be rewritten as a dot product, but since the dependent variables are multiple, we have a lot of vectors:

Stacking all the i observations in matrix notation:

This seems to be the same as the stacked matrix notation of multiple regression. However, notice that all y terms here are vectors! So, for each vector y in the first column vector, we have:

Indeed, as the multiple linear regression was a generalization in multiple dimensions of the simple linear regression, the multivariate linear regression is a generalization of the univariate linear regression. That is, it isn’t a separate statistical linear model, but just a compact way of simultaneously writing several multiple linear regression models. In the exact same way as we stacked the multiple observations in the models above, we can stack models!

Therefore, considering all observations, writing it in way too long explicit matrix notation, (but I believe in finding patterns in math notation, rather than just studying all the formal stuff) we have:

Luckily, also this can be conveniently compacted to:

...wait, why did lowercase epsilon (ϵ) become uppercase xi (Ξ)? Because uppercase epsilon is used for matrix analogue of SSE (don’t worry about this). Blame staticians for the inconsistent notation!

Ok, ok, a lot of matrices, but what’s the visual insight here? Well, the fact that is multiple means that we’re in more than 2D. The fact that is multivariate simply means that there’s more than one hyperplane!

So, for a simple and viewable example, in the case of:

We have two 2D hyperplanes in a 3D ambient space:

We spoke about multivariate (multiple) linear regression, but doesn’t it implies that multivariate simple linear regression exists too? Yes, it does. But the reason why nobody talks about it, is that it is not used, and just (as for simple linear regression) a particular case of a more general one. But just for completeness, this would be its form for a single observation i:

Therefore, if we take the simplest case:

Its graphical meaning is:

Do you have questions? Something is not clear? Did you spot an error? :’) Any suggestion? Just reach me out: https://linktr.ee/voidpunk

You can find the whole code of this series as a Jupyter notebook on my GitHub: https://github.com/voidpunk/ML-Notes :)

That’s all for this chapter, in the next one we’ll dive into the optimization algorithms, from the loss function to the gradient descent algorithm! Stay tuned :)

--

--