House Price Prediction using Linear Regression from Scratch

Tanvi Penumudy
Analytics Vidhya
Published in
5 min readJan 11, 2021

Today, let’s try solving the classic house price prediction problem using Linear Regression algorithm from scratch.

For more on Linear Regression, do not forget to check out my previous blog — Everything You Need to Know About Linear Regression

We’ll be needing the House Prices — Advanced Regression Techniques dataset to do this job.

Image Source: Ritu Yadav (GitHub)

Make sure to have a look at the data description as stated in the GitHub Gist below:

Importing Kaggle Dataset via Kaggle Temporary Token (On Google Colab):

For more on Google Colab — A Beginner’s Guide for Getting Started with Machine Learning

from google.colab import files 
"""upload your Kaggle temporary token downloaded from your Kaggle account onto your local device"""
files.upload()
Out:
Saving kaggle.json to kaggle.json
{'kaggle.json': b'{"username":"xxx","key":"yyy"}'}

Download your dataset:

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c house-prices-advanced-regression-techniques

Check if your files are downloaded:

!ls

Import libraries:

import pandas as pd 
import numpy as np
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

Read the dataset:

data = pd.read_csv("train.csv")
data
data.describe()

Shuffle the data:

data = data.sample(frac=1).reset_index(drop=True)
data

Check for NaN values:

data.isna().sum()

Checking individually for NaN value count:

print(data['Alley'].isna().sum())
print(data['FireplaceQu'].isna().sum())
print(data['PoolQC'].isna().sum())
print(data['Fence'].isna().sum())
print(data['MiscFeature'].isna().sum())
Out:
1369
690
1453
1179
1406
data.shapeOut:
(1460, 81)

You can also choose the drop the columns with excess NaN values or impute them:

#data.drop(['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis = 1, inplace=True)

#data.drop(['LotFrontage'], axis=1, inplace=True)

Dropping Id, since it isn’t of any significance:

data.drop(['Id'], axis=1, inplace=True)
data.shape
Out:
(1460, 80)

Imputing Missing Values:

# Imputing Missing Values
from sklearn.base import TransformerMixin
class DataFrameImputer(TransformerMixin):

def __init__(self):
"""Columns of dtype object are imputed with the most frequent value in column. Columns of other types are imputed with mean of column."""
def fit(self, X, y=None):
self.fill = pd.Series([X[c].value_counts().index[0]
if X[c].dtype == np.dtype('O') else X[c].mean() for c in
X],index=X.columns)
return self
def transform(self, X, y=None):
return X.fillna(self.fill)
X = pd.DataFrame(data)
data = DataFrameImputer().fit_transform(X)
data.isna().sum()

Label Encoding the Categorical Variables:

LE = LabelEncoder()
CateList = data.select_dtypes(include="object").columns
print(CateList)
data.head()
for i in CateList:
data[i] = LE.fit_transform(data[i])
data.head()

Scale the values using MinMaxScaler or StandardScaler:

#from sklearn.preprocessing import StandardScaler
df = data.iloc[:,:-1]
mm = MinMaxScaler()
df[:]= mm.fit_transform(df[:])
df.head()
X = df.values
y = data['SalePrice'].values
X_shape = X.shape
X_type = type(X)
y_shape = y.shape
y_type = type(y)
print(f'X: Type-{X_type}, Shape-{X_shape}')
print(f'y: Type-{y_type}, Shape-{y_shape}')
Out:
X: Type-<class 'numpy.ndarray'>, Shape-(1460, 79)
y: Type-<class 'numpy.ndarray'>, Shape-(1460,)

Splitting our data into Training and Testing data:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)
Out:
(1095, 79) (365, 79)
(1095,) (365,)

Writing our predict function that returns our hypothesis:

def predict(X, weights):
y_pred = np.dot(X, weights)
assert (y_pred.shape==(X.shape[0],1))
return y_pred

Defining mean_squared_error function that returns loss\cost function value at that given training example:

Loss Function: When you’re only considering a single training example.

Cost Function: When you’re considering the entire batch/mini-batch.

def mean_squared_error(y_true, y_pred): 
loss = (1/(2*y_true.shape[0])*np.sum(y_true-y_pred)**2)
return loss

Defining our gradient (the gradient matrix is initialized to 0):

def gradient(X, y_true, y_pred):
grad = np.zeros((len(X[1]),1))
diff = y_pred-y_true
for i in range(len(X[1])):
grad[i][0] = (2/X.shape[0])*np.sum(np.dot(X[:,i],(diff)))
return grad

Defining our gradient descent function (initializing our weights to random numbers — can also be initialized to 0):

def gradient_descent(X, y, learning_rate=0.01, max_iterations=100):

weights = np.random.rand(len(X[1]),1)
losses = []

y_true = y.reshape(-1,1)
for i in range(max_iterations):
y_pred = predict(X,weights)
losses.append(mean_squared_error(y_true,y_pred))
grad = gradient(X,y_true,y_pred)

for i in range(len(X[1])):
weights[i][0] = weights[i][0] - learning_rate*grad[i][0]

return weights, losses

Let’s see the optimal weights that our model has learnt on our training data:

optimal_weights, losses = gradient_descent(X_train, y_train, 0.001, 200)

Tune your hyperparameters — Alpha (Learning Rate) and Max-iterations to see its impact on accuracy)

print("Root mean-squared error:", losses[-1]**(1/2))Out:
Root mean-squared error: 5116.780901005974

Checking if your gradient descent is actually working (if the losses are decreasing):

for i in range(len(losses)):
print(losses[i]**(1/2))
………

As you can see, our losses are continuously decreasing for each successive iteration, meaning our gradient descent it working fine!

Seeing our predictions on our training data:

train_pred = predict(X_train, optimal_weights)
train_pred
Out:
array([[191931.99404727],
[172596.37961212],
[211588.65076176],
...,
[204967.81704443],
[166346.22710897],
[192415.28850303]])
#actual values
y_train
Out:
array([196000, 147500, 253293, ..., 235000, 167500, 250000])

Inferring our trained model on the testing set:

test_pred = predict(X_test, optimal_weights)
test_pred
Out:
array([[162556.88304305],
[195496.11789844],
[180261.96458508],
...,
[163731.79340151],
[201160.3300839 ],
[175285.52648482]])

Plotting our loss curve:

import matplotlib.pyplot as plt
plt.plot([i for i in range(len(losses))], losses)
plt.title("Loss curve")
plt.xlabel("Iteration num")
plt.ylabel("Loss")
plt.show()

You can compare the same problem using Sklearn’s inbuilt function:

from sklearn.linear_model import LinearRegressionmodel = LinearRegression().fit(X_train, y_train)
pred = model.predict(X_train)
r2_score(y_train, pred)
Out:
0.8540160463212708
pred2 = model.predict(X_test)
r2_score(y_test, pred2)
Out:
0.8431209747429238

You can evaluate your models on different Regression Metrics such as MSE, RMSE, MAE, R2 Score etc.

For more on this do check out my previous blog— Everything You Need to Know About Linear Regression

For complete code, do check my GitHub Repository —

Do not forget to CLAP and FOLLOW if you’ve enjoyed this article :)

--

--