Building Linear Regression in Python

Rhett Allain
The Startup
Published in
7 min readAug 31, 2020

Yes, I know — there is a built in function in the python numpy module that does linear (and other powers) fitting. It looks like this.

m,b = polyfit(x, y, 1)

But what if you are using VPython (Glowscript)? This is the online version of python with the built in visual module. It has awesome stuff for making graphs, vector calculations, and even 3D visualizations. But what if you want to use Glowscript AND find a linear function that best fits some points or data? What then?

Oh yeah — I’m going to show you how to do this.

What Is a Linear Regression Fit?

Let’s start with an example. Suppose I have some data and I want to find a linear function that best matches this data. Something like this:

That’s 6 data points and one linear function. Let’s say the line has the mathematical form of:

But what values for A and B produce a line that best fits the data? Yes, this is why it’s called a “best fit” line. If you made this line on actual graph paper, you could just “eyeball” it. You could use a straight edge and just approximate the best fit line. But you can’t do this with python (well, maybe you could — that would be a fun experiment for later).

Other than estimating, the most common method to find these parameters (A and B) is with a Least Squares Fit. The idea is to pick some line and find the square of the y-difference between each data point and the line. Then you do this for ALL the data points and sum the square of there differences between the line.

Maybe a diagram will help.

It’s the square of this vertical distance that I will add up. The y-value on the line depends on the x-value of the data point. This gives a change in y value of:

Squaring this and finding the sum for all data points, I get the following.

You could have any values for A and B and get a line along with a sum of the squares of differences. However, we want the values of A and B that gives the minimum sum. That’s why it’s called “Least Squares Fit” — get it?

OK, before we go further let’s start playing with some data. I wanted to use real data — so this is the “water bottles saved” value as a function of day for the water fountain in the hallway (I’ve been collecting data for a while now). For the line, I just calculated the slope and y-intercept using the first and last data points.

Now I want to take each of these data points and find the difference between the vertical value and the line. Square those and add them up. Here’s the code that does that (here is the full code with the plot and EVERYTHING). Oh, I should note that my time values are in a list called “t” and the water is in a list
“v”.

sum=0
for j in range(len(t)):
sum=sum+(v[j]-b-m*t[j])**2

There’s a lot going on here, so let me make some comments.

  • The “for” loop goes through a number equal to the length of the t-list. That’s what that second line says.
  • The third line just calculates the difference between the v variable and the point below it on the line.

Running this calculation gives a square difference of 9591.26 — but that doesn’t really mean much. What if I change the slope of my “best fit” line? The plot above has a slope of 10.72 bottles per day (remember, slopes have units). Here are some slopes along with their sum of difference squared. (Technically, the sum has units too — but I’m too lazy to include those)

  • Slope = 10 bottle/day, sum = 14,599
  • Slope = 10.72 bottle/day, sum = 9591.21
  • Slope = 12 bottle/day, sum = 11,927

You can see that a slope of 10.72 b/d is lower than the other two slopes. But is it the lowest? Oh, let’s find out. Here is a plot of the sum of the square of differences for different values of the slope. (here’s the code)

But wait! That might not be the best-fit slope even though it’s the minimum for the function. What about changing the vertical intercept? Yup, we would have to change this value also (and at the same time) to find the real best fit line.

Just for fun, here is a variation of the y-intercept with the slope that gives the minimum sum.

Ok, but clearly there’s a better way to find these coefficients A and B — I just wanted to make some plots so you could really see what’s going on.

Really, we want to do a minimization of a function with two variables. This means that we need to take the partial derivative of the sum with respect to A and set that equal to zero (the slope at the minimum is zero). Then we need to also take the partial derivative of the sum with respect to B and set it equal to zero. With these two equation and two unknowns (A and B), we can solve for the stuff.

I’m not going to go through the details (because it gets slightly messy because of the summations) — so I’ll just give you the answer. Oh, but here is a video with more details. Here is how to calculate the two coefficients.

I like this format since it’s a little bit shorter. However, notice that you have to calculate the B coefficient first as it’s used in the A calculation. Also, these are all summations over all the data points.

Here is the code (I cut off the top) to calculate this least squares fit.

Now for the comments:

  • Line 11: t is a list of the days in the data. So, len(t) returns the length of this list — that would be N.
  • Lines 13–16: these just set up the initial values for all the sums that I need to calculate.
  • Lines 18–22: this makes a loop that goes through all the data in the two lists (t and v in this case) and adds all the sums.
  • Ignore line 23 — that was left over from a different way to calculate this.
  • Line 25–27: Calculate and print the two coefficients.
  • Line 30–31: Plot the actual data points.
  • Line 33–38: Plot a line for best fit line.

Boom. That’s it. It looks like it works. Here is the full code.

TL;DR

Oh, you just want to get this to work in Glowscript. Here is a version that should work. The key is to put this function at the beginning of your code.

def fitstuff(x,y):
#this function takes two lists of data
#returns A and B
#where y = A + Bx
#just to be clear - B is the slope.

N=len(x)
sumx=0
sumx2=0
sumxy=0
sumy=0
for j in range(N):
sumx=sumx+x[j]
sumx2=sumx2+x[j]**2
sumxy=sumxy+x[j]*y[j]
sumy=sumy+y[j]
B =(N*sumxy-sumx*sumy)/(N*sumx2-sumx**2)
A=(sumy-B*sumx)/N
return(A,B)

If you want to give it a name other than fitstuff — well, I guess that would be OK.

Now, to use this you would just put something like this in your code.

A,B=fitstuff(t,v)
print("A = ",A,", B = ",B)

That works. It does the same thing as the previous program. However, you use whatever data you want and still get a linear fit. Also, I sort of like this better than the numpy version since you can see exactly what’s going on.

Homework

Just for fun.

  • There are other ways to find a least squares fit. This method finds the difference in just the vertical direction. What about a difference in the x-direction? What about shortest distance (perpendicular distance) to the line.
  • Try some data and find the parameters with a least squares fit (like with this program). Does MS Excel give the same values with its “trendline” function? What about Google Sheets?
  • Here’s the one I want to do myself. What about a program that just picks a linear function and finds the sum of the squares of differences. It then changes the slope and the intercept to see if this sum gets smaller. This would be a numerical method to finding these coefficients. It would be fun.

--

--

Rhett Allain
The Startup

Physics faculty, science blogger of all things geek. Technical Consultant for CBS MacGyver and MythBusters. WIRED blogger.