
Simple Linear Regression with Python
In my previous article, I talked about Simple Linear Regression as a statistical model to predict continuous target values. I also showed the optimization strategy the algorithm employs to compute the regression’s coefficients α and β.
Here, I’m going to provide a practical explanation of what I’ve been talking about, and I’m going to do so by using Python and one of its available datasets. For this purpose, as it being a regression task, I’m going to use the Boston House Price dataset (you can see the copy on UCI ML here). It contains 14 variables: the Median Value of an owner-occupied house in $1000 (the 14th variable) is the target we want to predict (which is, the price of the house), while the remaining 13 variables will be our predictors or features or independent variables.
Let’s have a first look at out dataset:
import pandas as pd
from sklearn import datasetsboston=datasets.load_boston()
boston_df=pd.DataFrame(boston.data, columns=boston.feature_names)
boston_df['target']=boston.target
boston_df.head()

For the sake of clarity, I’m attaching here the meaning of each feature (you can find this and further information on the official documentation of scikit-learn):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centers
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk — 0.63)² where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000’s
Since we are going to explain the Simple Linear Regression, we are going to use only one of those features, so that we can visualize our results in a 2D graph. However, how can we decide which feature is the one containing the greatest amount of information?
One possible approach could be that of Principal Component Analysis (you can read more about this technique here), which creates new variables (as combinations of the original ones) so that the first ones contain the greatest possible amount of information.
However, here I will use one of the original variables, but I’d like to pick the most representative one. Hence, I’m going to examine the correlation between each feature and the target, then I’ll pick that with the highest correlation:
import seaborn as sns
import matplotlib.pyplot as plt
correlation_matrix = boston_df.corr().round(2)
plt.figure(figsize=(14, 12))
sns.heatmap(data=correlation_matrix, annot=True)

Well, it seems that the feature which mainly affects the price is LSTAT (percentage of the lower status of the population), with a Pearson coefficient of -0.74.
Let’s visualize this relationship graphically:
plt.scatter(boston_df['LSTAT'], boston_df['target'], color='black')
plt.xlabel('LSTAT')
plt.ylabel('price')

As you can see, there is an evident negative correlation between the feature and the target.
Now it’s time to train our model. To do so, we will use ML libraries embedded in Python. Keep in mind that the optimization procedure followed by this model will be the same of statistical Simple Linear Regression, that is, Ordinary Least Squares method.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(boston_df[['LSTAT']], boston_df['target'], random_state=0)from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
y_pred = lm.predict(X_test)
We can visualize our results graphically:
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.xlabel('LSTAT')
plt.ylabel('price')
plt.show()

Now it’s time to evaluate the performance of our algorithm. To so so, we will use the Mean Squared Error evaluation metric, which is defined as follows:

Basically, it computes the distance between the model boundary (that is, the blue, straight line plotted above) and each data point, then it elevates it to the power of two (so that negative errors are not compensated by positive ones — and vice-versa).
Furthermore, we will also compute the R Squared metric: it provides an indication of the goodness of fit of a set of predictions to the actual values. In statistical literature, this measure is called the coefficient of determination.
This is a value between 0 and 1 for no-fit and perfect fit respectively. It shows which is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
If we consider the following values:

and knowing that the following relation is true:

We can compute our R Squared as follows:

So let’s compute these two metrics:
from sklearn.metrics import mean_squared_error, r2_score
print("MSE: {:.2f}".format(mean_squared_error(y_test, y_pred)))
print("R2: {:.2f}".format(r2_score(y_test, y_pred)))

Because of the way our coefficients have been computed, we know that the MSE has been minimized. Furthermore, the R Squared coefficient is not that bad too: it is telling us that our model explains almost half of the total variance (knowing that we are considering only one feature out of 13, it is surprisingly good. Furthermore, keep in mind that, in real scenarios, it is hard to find models with R Squared higher than 50%).
As you can see, we approximated the relationship between one feature and our target with Linear Regression which, despite its being the simplest model for regression, gave a decent result.
For the sake of completeness, I’m attaching here the Multiple Linear Regression version of our task, where I’m going to consider all the 13 features of our problem:
X_train, X_test, y_train, y_test= train_test_split(boston_df[['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE', 'DIS','RAD','TAX','PTRATIO','B','LSTAT']], boston_df['target'],random_state=0)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
y_pred = lm.predict(X_test) #let's evaluate our resultsfrom sklearn.metrics import mean_squared_error, r2_score
print("MSE: {:.2f}".format(mean_squared_error(y_test, y_pred)))
print("R2: {:.2f}".format(r2_score(y_test, y_pred)))

As you can see, with Multiple Linear Regression we improved both our MSE and R Squared. However, keep in mind that making your model far too complex (which means, build it with too many variables) might be counterproductive, leading to overfitting: hence, a good balance between simplicity and accuracy of predictions has to be found.