Image for post
Image for post

Simple Linear Regression with Python

Valentina Alto
Aug 17, 2019 · 6 min read

In my previous article, I talked about Simple Linear Regression as a statistical model to predict continuous target values. I also showed the optimization strategy the algorithm employs to compute the regression’s coefficients α and β.

Here, I’m going to provide a practical explanation of what I’ve been talking about, and I’m going to do so by using Python and one of its available datasets. For this purpose, as it being a regression task, I’m going to use the Boston House Price dataset (you can see the copy on UCI ML here). It contains 14 variables: the Median Value of an owner-occupied house in $1000 (the 14th variable) is the target we want to predict (which is, the price of the house), while the remaining 13 variables will be our predictors or features or independent variables.

Let’s have a first look at out dataset:

boston=datasets.load_boston()  
boston_df=pd.DataFrame(boston.data, columns=boston.feature_names)
boston_df['target']=boston.target
boston_df.head()
Image for post
Image for post

For the sake of clarity, I’m attaching here the meaning of each feature (you can find this and further information on the official documentation of scikit-learn):

  • CRIM per capita crime rate by town
  • ZN proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS proportion of non-retail business acres per town
  • CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX nitric oxides concentration (parts per 10 million)
  • RM average number of rooms per dwelling
  • AGE proportion of owner-occupied units built prior to 1940
  • DIS weighted distances to five Boston employment centers
  • RAD index of accessibility to radial highways
  • TAX full-value property-tax rate per $10,000
  • PTRATIO pupil-teacher ratio by town
  • B 1000(Bk — 0.63)² where Bk is the proportion of blacks by town
  • LSTAT % lower status of the population
  • MEDV Median value of owner-occupied homes in $1000’s

Since we are going to explain the Simple Linear Regression, we are going to use only one of those features, so that we can visualize our results in a 2D graph. However, how can we decide which feature is the one containing the greatest amount of information?

One possible approach could be that of Principal Component Analysis (you can read more about this technique here), which creates new variables (as combinations of the original ones) so that the first ones contain the greatest possible amount of information.

However, here I will use one of the original variables, but I’d like to pick the most representative one. Hence, I’m going to examine the correlation between each feature and the target, then I’ll pick that with the highest correlation:

Image for post
Image for post

Well, it seems that the feature which mainly affects the price is LSTAT (percentage of the lower status of the population), with a Pearson coefficient of -0.74.

Let’s visualize this relationship graphically:

Image for post
Image for post

As you can see, there is an evident negative correlation between the feature and the target.

Now it’s time to train our model. To do so, we will use ML libraries embedded in Python. Keep in mind that the optimization procedure followed by this model will be the same of statistical Simple Linear Regression, that is, Ordinary Least Squares method.

from sklearn.linear_model import LinearRegression  
lm = LinearRegression()
lm.fit(X_train,y_train)
y_pred = lm.predict(X_test)

We can visualize our results graphically:

Image for post
Image for post

Now it’s time to evaluate the performance of our algorithm. To so so, we will use the Mean Squared Error evaluation metric, which is defined as follows:

Image for post
Image for post

Basically, it computes the distance between the model boundary (that is, the blue, straight line plotted above) and each data point, then it elevates it to the power of two (so that negative errors are not compensated by positive ones — and vice-versa).

Furthermore, we will also compute the R Squared metric: it provides an indication of the goodness of fit of a set of predictions to the actual values. In statistical literature, this measure is called the coefficient of determination.

This is a value between 0 and 1 for no-fit and perfect fit respectively. It shows which is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

If we consider the following values:

Image for post
Image for post

and knowing that the following relation is true:

Image for post
Image for post

We can compute our R Squared as follows:

Image for post
Image for post

So let’s compute these two metrics:

Image for post
Image for post

Because of the way our coefficients have been computed, we know that the MSE has been minimized. Furthermore, the R Squared coefficient is not that bad too: it is telling us that our model explains almost half of the total variance (knowing that we are considering only one feature out of 13, it is surprisingly good. Furthermore, keep in mind that, in real scenarios, it is hard to find models with R Squared higher than 50%).

As you can see, we approximated the relationship between one feature and our target with Linear Regression which, despite its being the simplest model for regression, gave a decent result.

For the sake of completeness, I’m attaching here the Multiple Linear Regression version of our task, where I’m going to consider all the 13 features of our problem:

#let's evaluate our resultsfrom sklearn.metrics import mean_squared_error, r2_score  
print("MSE: {:.2f}".format(mean_squared_error(y_test, y_pred)))
print("R2: {:.2f}".format(r2_score(y_test, y_pred)))
Image for post
Image for post

As you can see, with Multiple Linear Regression we improved both our MSE and R Squared. However, keep in mind that making your model far too complex (which means, build it with too many variables) might be counterproductive, leading to overfitting: hence, a good balance between simplicity and accuracy of predictions has to be found.

Data Driven Investor

empower you with data, knowledge, and expertise

Sign up for DDIntel

By Data Driven Investor

In each issue we share the best stories from the Data-Driven Investor's expert community. Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Valentina Alto

Written by

Cloud Specialist at @Microsoft | MSc in Data Science | Machine Learning, Statistics and Running enthusiast

Data Driven Investor

empower you with data, knowledge, and expertise

Valentina Alto

Written by

Cloud Specialist at @Microsoft | MSc in Data Science | Machine Learning, Statistics and Running enthusiast

Data Driven Investor

empower you with data, knowledge, and expertise

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store