Why Is the Mean So Important in Linear Regression?

Avinash Dubey
The Startup
Published in
3 min readJul 4, 2020

A tutorial on linear regression typically starts with some dependent variable “y” and one independent variable “x”. This is usually followed by an OLS (Ordinary Least Squares) deduction to find the (clichéd) line of best fit.

But hold on. Why do we start with one independent variable “x”? What if we have only “y” and no other information? Let’s find out, and in the process we would hopefully discover the things we already know as facts.

Let’s take a contrived example of 6 independent and untrained hobbyists who set out to find the number of Coronavirus a person needs to inhale to be infected with the dreaded Covid-19.

Here is the unscientific data-set they come up with: {7, 3, 9, 1, 6, 4}. However, we need to report only one number to WHO to help them find a cure. Let’s call it m.

We don’t know what m is, but we do know OLS (Ordinary Least Squares). To put it a little more mathematically: We need to find m such that this cost is minimized:

(7-m)² + (3-m)² + (9-m)² + (1-m)² + (6-m)² + (4-m

Let’s plot this on a chart and see what it looks like. We are looking at that beautiful orange curve which clearly has the lowest value at m=5.

It’s easy enough to set the derivative of the cost with respect to m to zero and arrive at the same conclusion i.e. dcost/dm = 0. I would leave the derivation to you, but if you follow through, here is where we finally arrive.

m = (7 +3 +9 +1 +6 +4) / 6, Or m = 5

And no prizes for guessing — this is nothing but the arithmetic mean of the numbers. Let’s add this to our chart with a blue line and (sure enough) it intersects the orange cost curve at the lowest point.

The mean of the values minimizes the sum of the squared distances. The corollary is that the mean is also the value that minimizes variance.

So, in the absence of any other information, the mean of the values is the best guess we have for the number of Coronovirus one has to inhale to get infected.

This is the reason why when we put an independent variable “x” (say age or pre-existing conditions) into the picture, we still refer back to the mean for calculations such as R² = SSR/SST.

In plain English, what we are trying to say with regression for the dependent variable “y” with the independent variable “x” is:

If the mean of the values is the best model we can have with “y” alone, how much of an improvement is a model that includes “x” as well. In other words, how much more of the variance in “y” is explained by variance in “x” than by just the mean of “y”

PS the more advanced readers among you would have realized the values are just samples and we need to do much more in terms of confidence intervals and p-values for a more robust approach. That’s coming up soon.

--

--