Why Linear Regression is a Projection

9 min readDec 17, 2017

Over the last half a year, I’ve had to learn a fair bit of linear algebra in order to understand the machine learning I’ve been studying. I have mainly done so through the brilliant online lectures of Professor Gilbert Strang from MIT.

In this post, I’d like to explain one of the most beautiful ideas that I came across in that course: that linear regression in statistics is actually a projection. More precisely, that least-squares linear regression is equivalent to an orthogonal projection.

I wanted this to require no background beyond basic arithmetic, however explaining the basics of linear algebra, and I mean really explaining, would require me to cover a lot of ground that I rather wouldn’t, at least not in this post. In particular, I am forced to expect basic familiarity with the following ideas:

Vectors and Matrices, and some of their basic operations (multiplication, transposes)
Vectors as points and lines in n-dimensional space
What is a derivative, and how to find minima of quadratic functions with them

A great resource to learn about the first two ideas is this series of videos. The first three videos or so will suffice (≈30 minutes), but I recommend watching the whole thing if learning about maths is your thing (and if you’re here, it probably is).

I will not be assuming anything else, though, so this post will be divided into three parts: two parts explaining the prerequisites, and one that will pull them together, like so:

What is a projection
A quick introduction to linear regression
Why linear regression is a projection

Feel free to skip either of the former two if you already feel comfortable with the ideas. I won’t be doing anything unorthodox in them.

1. What Is a Projection

Let’s say we have a 2-dimensional vector, v. This vector defines a line on the coordinate plane:

The line is simply the set of all points that are expressible as multiples of v, so naturally the resulting line is parallel to v itself.

In addition to v, let’s say we also have another 2-dimensional vector, x, this time defining the coordinates of a point in the same plane:

We wish to find the closest point on the v line to x. Let’s call that point x̄:

Algebraically, to say that x̄ lies on the v line, we say x̄ = cv, for some currently unknown constant c. Then, the vector representing the difference in location between x and x̄ is x̄–x. So, in order to find x̄, we need to find a c that minimises ‖x̄–x‖, the length of that difference vector and thus the distance from the original point to the new point.

The length of a vector can be measured in different ways, but we are going to use Euclidean distance, also known as the L² norm. The Euclidean distance is calculated by taking the square root of the sum of squares of each dimension of a vector:

where n is the dimension of the vector. In our case, n = 2.

The Euclidean norm is perhaps the most natural way to define distance on vectors, and follows straight from Pythagoras’ theorem. It is an odd idea that distance can be measured in other ways, but it is in the nature of mathematics to see what we can get away with by changing such things as how distance is defined. Other norms might feel weird, but can sometimes have useful properties depending on the problem at hand. We won’t make use of them, but it’s important to be aware they exist.

With the L² norm chosen, we can now find x̄. To do that, we just need to find a c which would minimise ‖x̄–x‖:

On the way, we have shown that the minimum occurs exactly when x̄–x is orthogonal to the line defined by v, that is, when their dot product is zero. This makes sense – After all, the shortest distance to a line is intuitively going to be at a right angle to that line, so when finding the closest point on line v, we should not be surprised to find it when the angle at x̄ is 90º.

So, to solve our problem, all we have to do is to move our point orthogonally onto the line defined by v.

(It is intriguing that this fact only holds for L², and no other L-norm. In other norms, the angle between v and x at x̄ may not be 90º, meaning the closest distance is not what we intuitively think it should be. There will be a follow-up post explaining how that works.)

There are two things to note here:

To find x̄, all we have to do is multiply x by P. And, since we picked x arbitrarily, this should work for any point we choose! This means that P represents a kind of general object rather than just a solution to the problem in the case of x – it is a transformation that takes points and produces the location of their closest points in the line defined by v. In fact, since it is expressed by a matrix, it is a linear transformation.
You might notice, purely algebraically, that
PP = [v(vᵀv)⁻¹vᵀ][v(vᵀv)⁻¹vᵀ] = v(vᵀv)⁻¹[vᵀv(vᵀv)⁻¹]vᵀ = v(vᵀv)⁻¹vᵀ = P.
That is, applying this transformation twice is the same as applying it once. Can you see why that should be the case, given how we derived it?

These two properties define something called a projection:

Wikipedia: a projection is a linear transformation P from a vector space to itself such that P²= P.

This means our P is a projection. In particular, an orthogonal projection, as we found above that x̄–x is orthogonal to v.

Generally, a projection is an operation that moves points into a subspace. You can think of it as being similar to ‘flattening’ an object, or better, casting a shadow – hence the name, projection.

A projection of x into the subspace defined by v — a line, in this case.

In summary: Given a point x, finding the closest (by the Euclidean norm) point to x on a line can be solved by applying a linear transformation. That linear transformation is called an orthogonal projection, and can be thought of as casting a shadow directly onto the line.

2. Linear Regression

Sometimes, we have a bunch of data and we want to draw a line of best fit through that data:

Each data point is a pair of an x-coordinate and a y-coordinate. We want to draw a line capturing that data, represented by an equation of the form y=ax (for simplicity, we’ll assume the line goes through the origin; that assumption does not bear on anything significant, just makes things a bit simpler to visualise).

Linear regression is simply the process of picking that line in such a way that the error between the line and the y-coordinate of each data point is minimal:

One problem we can encounter in doing this is that we need the positive magnitude of the error, whereas simply subtracting the data-y from the predicted y can give both positive and negative results. A way to circumvent this is to square the errors, and minimise those squares, which are guaranteed to be positive. This is called least-squares linear regression.

3. Why Least-Squares is an Orthogonal Projection

By now, you might be a bit confused. After all, in orthogonal projection, we’re trying to project stuff at a right angle onto our target space. Wouldn’t that do something more like this to the data?

This certainly is not linear regression. In linear regression, the errors are parallel to the y axis, which is clearly not the case here. So, how can linear regression be an orthogonal projection?

In this example, we have five points of data. Let’s arrange them in two columns: one column of all x coordinates, in order, and another of all the y coordinates, in the same order:

These are two 5-dimensional vectors, x and y. And, since the line we’re fitting is of the form y = ax , we can calculate the entire vector of predicted ys for each data point just by multiplying: ȳ = ax. We can also compute all the errors in the same form: e = ȳ – y.

Now, we’d like to find an a such that the square of each element in e is minimal. Does that sound familiar? We’re trying to project y into the line defined by x!

Here, we are operating in a 5-dimensional space, since there are 5 data points. We are trying to find the point on the line defined by x, which is closest to y, which does not itself lie on that line. This time, we aren’t interested in ȳ itself, but in a, but the problem remains equivalent.

We have already shown that minimising ‖e‖ (in the L² norm) is equivalent to an orthogonal projection. Since squaring is monotonic, that is equivalent to minimising ‖e‖² – which is the least squares objective! (Using other norms gets you other kinds of regression, and different projections – more on that in the next post!)

In other words, when we perform the orthogonal projection of y into the space defined by x, we are solving least-squares linear regression!

I find this fact amazing.

Firstly, because it generalises to multiple linear regression, which means the kind of mental picture you have to draw to visualise this becomes quite involved – not only are you trying to reason about n-dimensional space, but you’re also trying to imagine a line that’s perpendicular to a, say, three-dimensional space. That’s hard to do. My tip is to use a plane to visualise any higher-dimensional space; that has worked relatively well for me.

Secondly, because it elegantly explains the analytic solution to linear regression, which was given without explanation in both machine learning courses I took. You can see why they’d skip that, but it’s always unsatisfying to use maths on faith alone, so I’m glad I found this in the Strang lectures.

Thirdly, this problem (and especially the multiple linear regression version) ties together a whole bunch of concepts across calculus, linear algebra and statistics, and the way in which they come together is very satisfying. Sometimes maths can be a bit dry when you’re stuck working on technicalities in the same area – having ideas connect to each other like this is just fun.

But I think the biggest reason why I like this is the overall shape of the thought: We start with a problem that is inherently visual and geometric: plotting a line to fit a bunch of points. We take that problem, and reformulate it into a different inherently visual problem: finding a projection of a point onto a space. Super satisfying.

Thanks for reading! I hope you learned something new, or at least were reminded of a cool thing you already knew. Let me know if I made any mistakes or if there were confusing parts – I can tell my writing is quite clumsy here, so comments on where I’m being unclear are especially welcome.

Some thoughts I had as I wrote this:

Writing about maths is really hard to do on Medium. I will move somewhere else, where I could use LaTeX, before writing anything more.(Although I might write the post on different norms here, still, as that will mostly be graphs, I think.)
Writing about maths is really hard to do, period. On one hand you want to not skim over anything or use confusing terminology, but on the other you want to be concise – especially when adding another clarifying equation means uploading another screenshot from your iPad. It’s hard to judge what terms will be understandable, and (as someone whose knowledge of pure mathematics is quite lacking) it is a bit intimidating to be talking about topics on the edge of your knowledge.
Defining projections, for example, is kind of hard, since on one hand they’re this abstract idea in linear algebra, and on the other hand you really want to give a brief and intuitive grasp of the concept, as the abstract definition isn’t the important one in this situation.

Why Linear Regression is a Projection

1. What Is a Projection

2. Linear Regression

3. Why Least-Squares is an Orthogonal Projection

Written by Vladimir Mikulik