Applying Multiple Linear Regression in house price prediction

Antony Christopher
Analytics Vidhya
Published in
7 min readJan 7, 2021

Multiple linear regression refers to a statistical technique that is used to predict the outcome of a variable based on the value of two or more variables. It is sometimes known simply as multiple regression, and it is an extension of linear regression. The variable that we want to predict is known as the dependent variable, while the variables we use to predict the value of the dependent variable are known as independent or explanatory variables.

Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable. You can use multiple linear regression when you want to know:

  1. How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
  2. The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

Multiple Linear Regression Formula

Where:

  • yi​ is the dependent or predicted variable
  • β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.
  • β1 and β2 are the regression coefficients that represent the change in y relative to a one-unit change in xi1 and xi2, respectively.
  • βp is the slope coefficient for each independent variable
  • ϵ is the model’s random error (residual) term.

Assumptions of Multiple Linear Regression

Multiple linear regression is based on the following assumptions:

1. A linear relationship between the dependent and independent variables

2. The independent variables are not highly correlated with each other

3. The variance of the residuals is constant

4. Independence of observation

Going forward, Let see how to implement the house price prediction using multiple linear regression algorithm.

Import libraries

# Import numpy and pandas package import pandas as pd
import numpy as np
# Data visualizationfrom matplotlib import pyplot as plot
import statsmodels.api as sm
import seaborn as sns

Reading the dataset

data = pd.read_csv(r'Housing.csv')

Data Inspection

data.head(5)
Display first 5 records
data.info()
Illustrates the datatype definition for columns
data.describe()
Describe the descriptive statistics for the dataset
data.shape()
Total number of rows & columns

Data Cleaning

check if any null data present in the dataset

data.isnull().sum()
NULL() check

Finally, there is no null data present in the dataset. Seems there is no need of replacing the 0 values.

Detect Outliers

Outliers are extreme values that fall a long way outside of the other observations.

Created the separate function to detect outliers for the dataset. Here used the boxplot using Seaborn library.

def detectOutliers():
fig, axs = plot.subplots(2,3, figsize = (10,5))
plt1 = sns.boxplot(data['price'], ax = axs[0,0])
plt2 = sns.boxplot(data['area'], ax = axs[0,1])
plt3 = sns.boxplot(data['bedrooms'], ax = axs[0,2])
plt1 = sns.boxplot(data['bathrooms'], ax = axs[1,0])
plt2 = sns.boxplot(data['stories'], ax = axs[1,1])
plt3 = sns.boxplot(data['parking'], ax = axs[1,2])
plot.tight_layout()
detectOutliers()
Outlier Detection

Price and area have considerable outliers . Next step is to drop the outliers.

# Outlier reduction for priceplot.boxplot(data.price)
Q1 = data.price.quantile(0.25)
Q3 = data.price.quantile(0.75)
IQR = Q3 - Q1
data = data[(data.price >= Q1 - 1.5*IQR) & (data.price <= Q3 + 1.5*IQR)]
# Outlier reduction for areaplot.boxplot(data.area)
Q1 = data.area.quantile(0.25)
Q3 = data.area.quantile(0.75)
IQR = Q3 - Q1
data = data[(data.area >= Q1 - 1.5*IQR) & (data.area <= Q3 + 1.5*IQR)]

To verify the outlier is still exists,

detectOutliers()
Outliers removed in Price & Area

Data Visualization

sns.pairplot(data)
plot.show()
Pairplot

Next step visualizing the categorical variables

plot.figure(figsize=(20, 12))
plot.subplot(3,3,1)
sns.boxplot(x='mainroad', y='price', data=data)
plot.subplot(3,3,2)
sns.boxplot(x='guestroom', y='price', data=data)
plot.subplot(3,3,3)
sns.boxplot(x='basement', y='price', data=data)
plot.subplot(3,3,4)
sns.boxplot(x='hotwaterheating', y='price', data=data)
plot.subplot(3,3,5)
sns.boxplot(x='airconditioning', y='price', data=data)
plot.subplot(3,3,6)
sns.boxplot(x='furnishingstatus', y='price', data=data)
plot.show()
Boxplot

Data Preparation

As you can see the data for the categorical variables holds yes/no, semi-furnished/unfurnished/furnishsed.

To fit the data in the regression line, we need of numeric data not string.So, need to convert those string values to int.

def toNumeric(x):
return x.map({"no":0,"yes":1})
def convert_binary():
for column in list(data.select_dtypes(['object']).columns):
if(column != 'furnishingstatus'):
data[[column]] = data[[column]].apply(toNumeric)
convert_binary()
Mapped yes/no values

Next, to split the column for furnishingstatus that holds the value in three levels namely furnished/unfurnished/semi-furnished. To implement this need of dummy variables

status = pd.get_dummies(data['furnishingstatus'])
status
Dummy variables

Now, you don’t need three columns. You can drop the furnished column, as the type of furnishing can be identified with just the last two columns where

  • 00 will correspond to furnished
  • 01 will correspond to unfurnished
  • 10 will correspond to semi-furnished

To drop the very first column of furnished

status = pd.get_dummies(data['furnishingstatus'], drop_first=True)

Concat the status and main data frame as below,

data = pd.concat([data, status], axis=1)

Remove the column furnishing status which is no longer needed.

data.drop(columns='furnishingstatus',inplace=True)

After all the changes, the data frame looks like

After data cleaning

Includes the fields other than prices for the X data frame. For Y include the price field alone.

Y = data.price# includes the fields other than prices
X = data.iloc[:,1:]

MultiCollinearity

Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model.

Let me take a simple example from our everyday life to explain this. Colin loves watching television while munching on chips. The more television he watches, the more chips he eats and the happier he gets!

Now, if we could quantify happiness and measure Colin’s happiness while he’s busy doing his favourite activity, which do you think would have a greater impact on his happiness? Having chips or watching television? That’s difficult to determine because the moment we try to measure Colin’s happiness from eating chips, he starts watching television. And the moment we try to measure his happiness from watching television, he starts eating chips.

Eating chips and watching television are highly correlated in the case of Colin and we cannot individually determine the impact of the individual activities on his happiness. This is the multicollinearity problem!

from sklearn.preprocessing import MinMaxScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
def preprocessing(X):
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
variables = X_scaled
vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
vif["Features"] = X.columns
print(vif)

Passing the variables to check the multicollinearity is exists,

preprocessing(X)
MultiCollinearity — I

As a thumb rule, a VIF value greater than 5 means very severe multicollinearity. From the above results area and bedrooms having severe collinearity.

We need to drop those columns and confirm the collinearity is still exists.

X.drop(['area','bedrooms'], axis=1, inplace=True)
preprocessing(X)
Multicollinearity-II

Finally, there is no multicollinearity exists from the dataset. So we can proceed on the next step on splitting the training and testing sets.

Splitting the Data into Training and Testing Sets

Applying X and Y for training and test dataset with the respective coordinates as x_train & x_test.

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size = 0.25,random_state=355)

Create the LinearRegression model reference. Using the reference fit the line with training data. Now, we have well prepared our dataset in order to provide training, which means we will fit our regression model to the training set.

from sklearn.linear_model import LinearRegression
regression = LinearRegression()
regression.fit(x_train,y_train)

Make prediction

y_predict = regression.predict(x_test)

Plotting y_test and y_pred to understand the spread.

plot.scatter(y_test,y_predict)
fig.suptitle('y_test vs y_pred', fontsize=20)
plot.xlabel('y_test', fontsize=18)
plot.ylabel('y_pred', fontsize=16)
y_test vs y_pred
import statsmodels.api as sms
model_1 = sms.OLS(y_train, x_train).fit()
model_1.summary()
Statistics Summarization

This model has a higher value of R-squared (0.954), which means that this model explains more variance and provides a better fit to the data.

Hope everyone got some ideas on how to implement the Linear regression model on multiple independent variables.

We will catch up with another interesting topic in the coming days.

Happy learning :)

--

--

Antony Christopher
Analytics Vidhya

Data Science and Machine Learning enthusiast | Software Architect | Full stack developer