Simple Linear Regression: Kaggle House Prices Prediction

6 min readMar 15, 2019

Introduction

If you are interested in building predictive models but confused from where to start, then this article will help you build a Simple Linear Regression model in Python. I have used here the House prices competition dataset available at Kaggle. If you are new in the field of data science like me then Kaggle is a good place to start. Here you can :

Build predictive models
Compete with other participant’s
Learn new insights on various kinds of data.

The flow for building any predictive model (not only linear regression), let me repeat “any predictive model” will be same. It can be divided into 2 steps:

Exploratory Data analysis (EDA)
Model Building

Exploratory Data Analysis: Before Building a model we should make sure that the data on which we are about to build the model has all numeric columns, also we should get insights on what is the nature of each feature (i.e column), and many more stuffs (it’s a continuous learning process).

If this does not make sense to you now then don’t worry, understand that every column of the data should be numeric (also known as data cleaning).

Model Building: Once the data is clean we can build model on the data, this model can be:

Binary class Classification model (when the target is binary, i.e having only two classes)
Multi-class Classification model (when the target has multiple classes)
Regression model (when the target is continuous numeric variable)

In our case we are said to predict the “Sale price” of the house, so we will be building a Regression model.

Reading the Data:

We are provided with two data sets in csv format, one for training the model “train.csv” and the second for validating the model “test.csv”

# Importing the essential modules/packages

import numpy as np
import pandas as pdhp_train = pd.read_csv("House_prices_train.csv")
hp_test = pd.read_csv("House_prices_test.csv")# Sneakpeak of train data 
print("Shape of training data: ",hp_train.shape) 
hp_train.head()

# Sneakpeak of test data 
print("Shape of test data: ",hp_test.shape) 
hp_test.head()

Observe that test data has one less column as compared to train data, this is our target column to be predicted.

Lets quickly identify which column is missing in the test data

hp_train.columns.difference(hp_test.columns)

Data Cleaning:

Lets combine the train and test data before cleaning, while doing this…

Insert a new column named ‘data’. So that post data-cleaning we can again separate them exactly in the way it was given.
Insert a Target column to the test dataframe and fill it with NaN’s, this ensures that both the data sets have similar and equal number of columns

# Combining train and test datahp_train["data"] = 'train'
hp_test['data'] = 'test'
hp_test['SalePrice'] = np.nan
hp_test = hp_test[hp_train.columns]
hp = pd.concat([hp_train,hp_test],axis = 0)
print("Shape of combined Data: ",hp.shape)
hp.tail()

Here we go. We have combined the data sets, but is the data clean???

cat_cols = hp.select_dtypes(['object']).columnslist(zip(hp[cat_cols].columns,hp[cat_cols].dtypes,hp[cat_cols].nunique()))

There are too many non-numerical columns (categorical columns), we can create dummies for each categorical columns using ‘pandas.get_dummies’ function. Before this also check whether they do have any missing values

# Lets check the missing value percentage for categorical column
(hp[cat_cols].isnull().sum())*100/hp.shape[0]

Here I have imputed all the missing value in the categorical column with its mode and at some places created new category as ‘unknown’ depending on the details given in the data description on each column. For the detailed code click here. Similarly I have imputed all the missing values in the numeric column with its mean.

Still the categorical columns are non numerical, but with no missing values :) So lets convert all of the categorical cols to numerical (also known as dummy) using ‘pandas.get_dummies’ function and removing the original categorical column

# Creating Dummies
dummy_matrix = pd.get_dummies(hp[cat_cols[:-1]])# Removing original categorical cols
hp.drop(cat_cols[:-1],axis = 1,inplace = True)
hp.shapeprint("Shape of dummy matrix: ",dummy_matrix.shape)
dummy_matrix.head()

Lets concat this dummy matrix to our data.

# Concatenating dummies with data
hp = pd.concat([hp,dummy_matrix],axis = 1)
del dummy_matrix
hp.shapecat_cols = hp.select_dtypes(['object']).columns
list(zip(hp[cat_cols].columns,hp[cat_cols].dtypes,hp[cat_cols].nunique()))

Voila! No more categorical columns left, except ‘data’ which we are going to drop soon.

Let’s split our clean train and test data

hp_train = hp[hp["data"]=='train']
hp_test = hp[hp["data"]=='test']
del hp_train['data']
hp_test.drop(['data','SalePrice'],axis = 1,inplace = True)
hp_train.shape,hp_test.shape

Model Building:

The test column does not contain the target variable. So if we start building the model on the entire training data then how would we make out that how far are we from the actual values. Hence to observe the tentative performance lets first work over training data,by splitting them into two parts. I am using an 80%–20% split. You can go with 90%-10% or 70%-30%

from sklearn.cross_validation import  train_test_split
hp_train_train,hp_train_test = train_test_split(hp_train,test_size = 0.2,random_state = 2)

Now separating the predictors variables (independent variables) and the response variable(target variable)

# Separating the predictors and responsex_train = hp_train_train.drop("SalePrice",axis = 1)
y_train = hp_train_train['SalePrice']

Time to fit the model, we are building the most basic model (i.e linear regression model) which is available in ‘sklearn.linear_model’ package.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(x_train,y_train)

That’s it !!! Done with model building, but what does this model tells us

lm.intercept_list(zip(hp.columns,lm.coef_))

This will give us huge list and shows the dependencies of target on each predicted variables. We have built the model on 80% data, now let us predict the target for the rest on 20% data

x_test = hp_train_test.drop("SalePrice",axis = 1)
y_test = hp_train_test['SalePrice']p_test = lm.predict(x_test)

p_test holds the predicted value.

Tentative Performance Measure:

from sklearn.metrics import mean_absolute_error,r2_scoreprint("Linear_regression_mean_absolute_error:",mean_absolute_error(
y_test,p_test))residual = p_test - y_test
rmse = np.sqrt(np.dot(residual,residual)/len(p_test))
print("Root_mean_square_error : ",rmse)r_square = r2_score(y_test,p_test)
print("R-Square of the model : ",r_square)Output:mean_absolute_error:  19251.392477943875 
Root_mean_square_error :  37607.52732126701 
R-Square of the model :  0.786428655695136

We now have a rough idea on how our model is behaving, an R-square of 0.7846 says that the available predictor variables explains 78.4% behavior of the target. Now we can build the model on entire training data and predict for test data

# Now let us build the model on entire training data
x_train = hp_train.drop("SalePrice",axis = 1)
y_train = hp_train["SalePrice"]
lm.fit(x_train,y_train)p_test = lm.predict(hp_test)
p_test = pd.DataFrame(p_test)p_test.to_csv("Simple_Linear_regression.csv",index= False)

This gave me a score of 0.2009 with a rank of 3389 on the kaggle leaderboard.

In my next article we will be discussing on improving this model further. This improvement helped me to take my rank from 3389 to 1299