Predicting Boston Housing Prices : Step-by-step Linear Regression tutorial from scratch in Python

9 min readJul 28, 2019

“Artificial Intelligence, deep learning, machine learning — whatever you’re doing if you don’t understand it — learn it. Because otherwise you’re going to be a dinosaur within 3 years.”, said Marc Cuban (head of the NBA).

https://www.pexels.com/search/machine%20learning/

Machine Learning provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
In this first blog of my Machine Learning Series, I will focus on a simple but still fundamental model : Linear Regression.
While many libraries are available and optimized to speed up both optimization and development, we will go through each steps to implement it in Python from scratch using Numpy/Pandas.
In a second part, we will compare our implementation with another one involving Scikit-learn library. If you are not familiar with Python yet, I recommend this tutorial to get you on track : https://www.learnpython.org/

The full code source is available in my github repository.

Outline :
1. Linear Regression explained
2. Exploratory Analysis and Preprocessing
3. Numpy/Pandas Implementation
4. Scikit-learn Implementation
5. Conclusion

1. Linear Regression explained

The objective of Linear regression is to model a linear relationship between explanatory variables/features (independant variables) and a target (dependant variable). This is usually represented by the following equation :

Y is the predicted variable or target
w0 is the intercept or bias term
w1, …, wn are our model parameters
x1, …, xn are the features

Or, in matrix form :

W : parameter vector with intercept
X : feature vector

2.1 Exploratory Analysis

We start as usual by importing necessary libraries and loading the data set. Through this tutorial we will use the Boston housing data set, integrated in scikit-learn library. It is also available in Kaggle.

%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(rc={'figure.figsize':(15,10)})
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
import datetime
import mathboston = load_boston()
boston_dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_dataset['MEDV'] =  boston.target

Here is the list of all features in the data set and their description.

Based on the first 13 features, we want to find a parameter vector W to predict our target variable Y, i.e, “mdev” which will represent the prices.
Now, you may wonder “Are all these features relevant for our model ? “.
Not necessarily.
Actually, Linear regression should satisfy the Ordinary Least Squares Assumptions (OLS) .
Among them, one states that the independant variables (features) must be uncorrelated with each other.
Furthermore, for our model to be effective, it is better to only include features which can explain the predicted variable (features highly correlated with the target), otherwise they are just as good as adding noise in the model.

Hence, thanks to a correlation matrix we decide to keep only one feature for each highly correlated pairs of features.
We also filter out those which don’t have enough explanatory power (not highly correlated with the target).

corr_matrix = boston_dataset.corr().round(2)
sns.heatmap(data=corr_matrix, annot=True, cmap="BrBG")
plt.show()#Relevant features are features highly correlated with our target variable
relevant_features = abs(corr_matrix["MEDV"])
relevant_features = relevant_features[relevant_features > 0.5]#Because of Least Squares Assumptions (LSA), the independant variables (features) must be uncorrelated with each other
boston_dataset[["RM", "PTRATIO", "LSTAT"]].corr().round(2)

RM, PTRATIO and LSTAT are highly correlated with our target MDEV.
However, RM and LSTAT are highly correlated with each other, hence, we should select only one of them.
We choose to take the one which is more correlated with the target variable.
Finally, we end up with LSTAT and PTRATIO as our set of features.

X = boston_dataset[['PTRATIO', 'LSTAT']]
y = boston_dataset.MEDV

Note : we chosed an arbitrary threshold of 0.5 (in absolute value) to make the decision between highly correlated or not.

2.2 Preprocessing

When given a dataset, best practices in Machine Learning are to split the data set in training set (bigger one) and test set (smaller one).
The first one is used to train/fit our model to the data, while the second will be used as a test to determine how well our model behave on new/unseen data.
We choose to split them into 90% for the training set and 10% for the test set.
Other methods will be covered in another post.

def split_data(x, y, ratio, seed=1):
    """split the dataset based on the split ratio."""
    
    # set seed to have reproducible/consistent results
    np.random.seed(seed)
    
    # generate random indices
    num_row = len(y)
    indices = np.random.permutation(num_row)
    index_split = int(np.floor(ratio * num_row))
    index_tr = indices[: index_split]
    index_te = indices[index_split:]
    
    # create split
    x_tr = x.iloc[index_tr]
    x_te = x.iloc[index_te]
    y_tr = y.iloc[index_tr]
    y_te = y.iloc[index_te]    return x_tr, x_te, y_tr, y_te

Each subset is then pre-processed independantly of the other. For each of them, we normalize it by removing the mean and dividing the resulting set by the standard deviation in each dimension.
Finally, we add a column filled of ones to include the bias term (or intercept).

def standardize(x):
    """Standardize the original data set."""
    return (x - x.mean(axis=0))/ x.std(axis=0)def build_model_data(x):
    """Form tX to get regression data in matrix form. tX is X with intercept"""
    num_samples = y.shape[0]
    x["INTERCEPT"] = pd.Series(np.ones(num_samples))
    return x

As a summary, the data set is pre-processed as follow:

def prepare_model(X, y, r, s) :
    """Apply all previous helper functions to preprocess data"""
    
    X_train, X_test, y_train, y_test = split_data(X, y, r, seed=s)
    
    X_train = standardize(X_train)
    X_train = build_model_data(X_train)
    
    X_test = standardize(X_test)
    X_test = build_model_data(X_test)
    
    return X_train, X_test, y_train, y_test

3. Pandas/Numpy implementation

We will use a mean-square-error (MSE) loss function, defined below, as measure of goodness/fit for our model. It tells us how well our model is at making predictions given a set of parameters/coefficients.
We aim at minimizing this cost by finding the optimal W*.

Optimization algorithm

In order to find the optimal set of parameters, we will use gradient descent (GD), an optimization algorithm used to minimize some function by iteratively moving in the negative direction of the gradient.
At each step, we compute the gradient of the cost function with the current weights/coefficients and take a step in the opposite direction specified.
We repeat this process until convergence.

γ is the learning rate. It specifies the size of the step taken in the direction of the negative gradient.
The higher the step, the faster our algorithm converges, however, we face the risk of missing the minimum by going “too far”.
Lower step-sizes are more confident but are more time-consuming.
The learning rate is called an hyperparameter and can be tuned.
In this tutorial, we choose an usual value of 0.7 for the learning rate without hyperparameter tuning process.

Gradient Descent : update rule

Please find below the partial derivatives to obtain the gradient of the loss function with respect to the set of parameters W.

Gradient derivation : partial derivatives

Now that we derived the maths, we are ready to implement Gradient Descent. We use efficient built-in numpy/pandas operations to speed up computation time.

def mse(e):
    """Compute the mse for the error vector e."""
    return 1/2*np.mean(e**2)def compute_gradient(y, tx, w):
    """Compute the gradient."""
    err = y - tx.dot(w)
    grad = -tx.T.dot(err) / len(err)
    return grad, errdef gradient_descent(y, tx, initial_w, max_iters, gamma, epsilon = 10e-5):
    """Gradient descent algorithm."""
    # Define parameters to store w and loss
    ws = [initial_w]
    losses = []
    w = initial_w
    prev = math.inf
    for n_iter in range(max_iters):
        # compute loss, gradient and rmse(actual loss)
        grad, err = compute_gradient(y, tx, w) 
        loss = np.sqrt(2 * mse(err))
        # gradient w by descent update
        w = w - gamma * grad
        # store w and loss
        ws.append(w)
        losses.append(loss)
        #Stop earlier if we reached convergence
        if(abs(loss - prev) < epsilon) :
            print("Reached Convergence !")
            break
        prev = loss
        print("Gradient Descent({bi}/{ti}): loss={l}, w0={w0}, w1={w1}, w2={w2}".format(
              bi=n_iter, ti=max_iters - 1, l=loss, w0=w[0], w1=w[1], w2=w[2]))return losses, ws

Let’s run our algorithm for a maximum of 50 iterations of Gradient Descent. We use an epsilon value of 10e-5 to stop the algorithm if we reach convergence before the end.

# Define the parameters of the algorithm.
max_iters = 50
gamma = 0.7# Initialization
w_initial = np.array([0, 0, 0])# Start gradient descent.
start_time = datetime.datetime.now()
gradient_losses, gradient_ws = gradient_descent(y_train, X_train, w_initial, max_iters, gamma)
end_time = datetime.datetime.now()#Get final weights
params = gradient_ws[-1]#Model is y = w2 + w0*PTRATIO + w1*LSTAT
pred = params[2] + params[0]*X_test.PTRATIO + params[1]*X_test.LSTAT

After convergence, our model should have learned optimal parameters to make quite good prediction on new/unseen data.
Let’s see how our model perform !

#Evaluate performanceexection_time_np = (end_time - start_time).total_seconds()
print("Gradient Descent: execution time={t:.3f} seconds".format(t=exection_time_np))
train_loss_np = gradient_losses[-1]
print("Train loss : {}".format(train_loss_np))
test_loss_np = np.sqrt(2 * mse(y_test - pred))
print("Test loss : {}".format(test_loss_np))OUTPUT :
Gradient Descent: execution time=0.036 seconds
Train loss : 5.882134247497131
Test loss : 5.146806236501715

Note : We use here the RMSE loss instead of the MSE.

RMSE loss function

4. Scikit-Learn Implementation

Let’s see now how well an optimized version using scikit-lean library performs in comparison to our implementation from scratch.

We start by defining a Machine Learning model, i.e, Linear Regression.
The method fit is responsible to train the model and find optimal parameters for the weights.
Predict, as its name suggest, aims at making predictions given a set of features for a given model.

from sklearn.linear_model import LinearRegressionmodel = LinearRegression()
start_time = datetime.datetime.now()
model.fit(X_train, y_train)
end_time = datetime.datetime.now()exection_time_sk = (end_time - start_time).total_seconds()
print("Gradient Descent: execution time={t:.3f} seconds".format(t=exection_time_sk))
train_loss_sk = np.sqrt(2 * mse(y_train - model.predict(X_train)))
print("Train loss : {}".format(train_loss_sk))
test_loss_sk = np.sqrt(2 * mse(y_test - model.predict(X_test)))
print("Test loss : {}".format(test_loss_sk))Gradient Descent: execution time=0.012 seconds
Train loss : 5.882101189744941
Test loss : 5.149021510727107

Finally, we got comparable results in terms of performance on train and test set with respect to the RMSE .
Actually, scikit-learn source code uses Least squares method along with other optimization to make its training slightly more efficient. (training in 0.012 seconds with their implementation compared to ours with 0.036 seconds).
Furthermore, their implementation leads to a very efficient and compact code, as just 3 lines are needed to train and test.

If you are interested in their implementation, feel free to check it out here.

5. Conclusion

In this tutorial, we learned how to mimic known libraries such as scikit-learn to run a linear regression model and compared our implementation from scratch with theirs. We achieved comparable results regarding the goodness of fit, even though their implementation is more efficient in terms of computing time and code simplicity.

In the next blog of this Series, we will learn how to implement Logistic Regression for classification. Similarly, we will use a step-by-step approach implemented in pandas/numpy and compare perfomance with scikit-learn.

Thanks for reading ! Please make sure to like/share this blog and comment your reading experience.

Have any questions ? Feel free to contact me via Linkedin or follow me on Github for other interesting projects.