Building Linear Regression in Python

Rhett Allain
Aug 31, 2020 · 7 min read

Yes, I know — there is a built in function in the python numpy module that does linear (and other powers) fitting. It looks like this.

m,b = polyfit(x, y, 1)

But what if you are using VPython (Glowscript)? This is the online version of python with the built in visual module. It has awesome stuff for making graphs, vector calculations, and even 3D visualizations. But what if you want to use Glowscript AND find a linear function that best fits some points or data? What then?

Oh yeah — I’m going to show you how to do this.

What Is a Linear Regression Fit?

Let’s start with an example. Suppose I have some data and I want to find a linear function that best matches this data. Something like this:

That’s 6 data points and one linear function. Let’s say the line has the mathematical form of:

But what values for A and B produce a line that best fits the data? Yes, this is why it’s called a “best fit” line. If you made this line on actual graph paper, you could just “eyeball” it. You could use a straight edge and just approximate the best fit line. But you can’t do this with python (well, maybe you could — that would be a fun experiment for later).

Other than estimating, the most common method to find these parameters (A and B) is with a Least Squares Fit. The idea is to pick some line and find the square of the y-difference between each data point and the line. Then you do this for ALL the data points and sum the square of there differences between the line.

Maybe a diagram will help.

It’s the square of this vertical distance that I will add up. The y-value on the line depends on the x-value of the data point. This gives a change in y value of:

Squaring this and finding the sum for all data points, I get the following.

You could have any values for A and B and get a line along with a sum of the squares of differences. However, we want the values of A and B that gives the minimum sum. That’s why it’s called “Least Squares Fit” — get it?

OK, before we go further let’s start playing with some data. I wanted to use real data — so this is the “water bottles saved” value as a function of day for the water fountain in the hallway (I’ve been collecting data for a while now). For the line, I just calculated the slope and y-intercept using the first and last data points.

Now I want to take each of these data points and find the difference between the vertical value and the line. Square those and add them up. Here’s the code that does that (here is the full code with the plot and EVERYTHING). Oh, I should note that my time values are in a list called “t” and the water is in a list
“v”.

sum=0
for j in range(len(t)):
sum=sum+(v[j]-b-m*t[j])**2

There’s a lot going on here, so let me make some comments.

Running this calculation gives a square difference of 9591.26 — but that doesn’t really mean much. What if I change the slope of my “best fit” line? The plot above has a slope of 10.72 bottles per day (remember, slopes have units). Here are some slopes along with their sum of difference squared. (Technically, the sum has units too — but I’m too lazy to include those)

You can see that a slope of 10.72 b/d is lower than the other two slopes. But is it the lowest? Oh, let’s find out. Here is a plot of the sum of the square of differences for different values of the slope. (here’s the code)

But wait! That might not be the best-fit slope even though it’s the minimum for the function. What about changing the vertical intercept? Yup, we would have to change this value also (and at the same time) to find the real best fit line.

Just for fun, here is a variation of the y-intercept with the slope that gives the minimum sum.

Ok, but clearly there’s a better way to find these coefficients A and B — I just wanted to make some plots so you could really see what’s going on.

Really, we want to do a minimization of a function with two variables. This means that we need to take the partial derivative of the sum with respect to A and set that equal to zero (the slope at the minimum is zero). Then we need to also take the partial derivative of the sum with respect to B and set it equal to zero. With these two equation and two unknowns (A and B), we can solve for the stuff.

I’m not going to go through the details (because it gets slightly messy because of the summations) — so I’ll just give you the answer. Oh, but here is a video with more details. Here is how to calculate the two coefficients.

I like this format since it’s a little bit shorter. However, notice that you have to calculate the B coefficient first as it’s used in the A calculation. Also, these are all summations over all the data points.

Here is the code (I cut off the top) to calculate this least squares fit.

Now for the comments:

Boom. That’s it. It looks like it works. Here is the full code.

TL;DR

Oh, you just want to get this to work in Glowscript. Here is a version that should work. The key is to put this function at the beginning of your code.

def fitstuff(x,y):
#this function takes two lists of data
#returns A and B
#where y = A + Bx
#just to be clear - B is the slope.

N=len(x)
sumx=0
sumx2=0
sumxy=0
sumy=0
for j in range(N):
sumx=sumx+x[j]
sumx2=sumx2+x[j]**2
sumxy=sumxy+x[j]*y[j]
sumy=sumy+y[j]
B =(N*sumxy-sumx*sumy)/(N*sumx2-sumx**2)
A=(sumy-B*sumx)/N
return(A,B)

If you want to give it a name other than fitstuff — well, I guess that would be OK.

Now, to use this you would just put something like this in your code.

A,B=fitstuff(t,v)
print("A = ",A,", B = ",B)

That works. It does the same thing as the previous program. However, you use whatever data you want and still get a linear fit. Also, I sort of like this better than the numpy version since you can see exactly what’s going on.

Homework

Just for fun.

The Startup

Get smarter at building your thing. Join The Startup’s +800K followers.

Rhett Allain

Written by

Physics faculty, science blogger of all things geek. Technical Consultant for CBS MacGyver and MythBusters. Former WIRED blogger.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +800K followers.

Rhett Allain

Written by

Physics faculty, science blogger of all things geek. Technical Consultant for CBS MacGyver and MythBusters. Former WIRED blogger.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +800K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store