Understanding Linear Regression with Python

Simple linear regression… What’s the deal? This article hopes to give you a better understanding using examples and Python code snippets

Jonas Bostoen
4 min readApr 4, 2018

Machine learning is quite the rage these days. It sounds complicated and futuristic, although a lot of it is actually really simple. And I mean like high-school-math-simple. One example of this is regression. In this series I’ll go over 3 popular kinds of regression: simple linear regression, multiple regression and polynomial regression. We’ll use examples to make everything crystal clear. I’m including code snippets in Python for those of you who want to code along.

Simple Linear Regression

Imagine you meet a guy on the street and he’s a real bragging type. He’s telling you that he got this job at a law firm and he makes $100k a year. You’re not really surprised nor interested, but ask him how long he’s been working at law firms, and he says something like a year and a half. Now you’re curious, and decide to do some research, to help find out whether that guy was lying or not.

The document you all of a sudden get your hands on

You get your hands on a document of that firm that states how much experience employees have and how much those employees make. You decide to plot it out and it looks something like this:

As you can see, the data is linear and continuous, which makes it perfect for linear regression.

So what is linear regression? It all comes down to drawing a line that’s closest to all the observations. As you might remember, a straight line’s equation looks like this: y = mx + b

If we were to have the m and the b , we could, pretty accurately, predict what y would be given a future value of x . This is perfect for finding out whether that guy had been lying or not.

To find that best-fitting line, there’s a number of methods that can be used. I won’t go into the maths of it (click here if you want to, though), but the method most widely used is the method of least squares.

This is the method we’ll use. We’ll also use a library called scikit-learn to make everything go smoother. First, we have to divide our data up into a matrix of features and a vector of labels. The features are what we’ll base our predictions on, in this case — years of experience. The labels, or answers, are the things we want to predict. If you’re coding along: this is how we do it in Python:

The purpose of our machine learning algorithm is to be able to make predictions after learning. This is why we’ll split our data into training sets and test sets. Scikit-learn splits these randomly, so we can’t cheat and make our model look better than it actually is. The training set will be used to calculate the regression line, while the test set will see how accurate that regression line aligns with reality. Since there are 30 observations, we’ll make the training set consist of 20 of them, and the other 10 will be used for the test set:

Now it’s time to make the regression line or regressor:

First, we initialized the linear regression model, and with .fit , we fitted the model to our training data. Time to visualize what we came up with:

The linear regression line

Looks pretty accurate to me! If we were to compare that line to the test set:

Since these results are pretty accurate, we can now see whether the guy was telling the truth or lying by using the .predict method on our regressor:

Okay, we got him. He makes $40,000 a year, not $100k. Next time we see him on the streets, we’ll show him our machine learning model. That’ll show him!

If you coded along, good job. I’ll write another tutorial on multiple regression, which is linear regression in more dimensions, where we’ll analyze some bitcoin price data. After that, we’re going to do some polynomial regression, which is basically non-linear regression. Stay tuned, and if you enjoyed the story, clap!

--

--