Applying Multiple Linear Regression in house price prediction
Multiple linear regression refers to a statistical technique that is used to predict the outcome of a variable based on the value of two or more variables. It is sometimes known simply as multiple regression, and it is an extension of linear regression. The variable that we want to predict is known as the dependent variable, while the variables we use to predict the value of the dependent variable are known as independent or explanatory variables.
Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable. You can use multiple linear regression when you want to know:
- How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
- The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).
Multiple Linear Regression Formula
Where:
- yi is the dependent or predicted variable
- β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.
- β1 and β2 are the regression coefficients that represent the change in y relative to a one-unit change in xi1 and xi2, respectively.
- βp is the slope coefficient for each independent variable
- ϵ is the model’s random error (residual) term.
Assumptions of Multiple Linear Regression
Multiple linear regression is based on the following assumptions:
1. A linear relationship between the dependent and independent variables
2. The independent variables are not highly correlated with each other
3. The variance of the residuals is constant
4. Independence of observation
Going forward, Let see how to implement the house price prediction using multiple linear regression algorithm.
Import libraries
# Import numpy and pandas package import pandas as pd
import numpy as np # Data visualizationfrom matplotlib import pyplot as plot
import statsmodels.api as sm
import seaborn as sns
Reading the dataset
data = pd.read_csv(r'Housing.csv')
Data Inspection
data.head(5)
data.info()
data.describe()
data.shape()
Data Cleaning
check if any null data present in the dataset
data.isnull().sum()
Finally, there is no null data present in the dataset. Seems there is no need of replacing the 0 values.
Detect Outliers
Outliers are extreme values that fall a long way outside of the other observations.
Created the separate function to detect outliers for the dataset. Here used the boxplot using Seaborn library.
def detectOutliers():
fig, axs = plot.subplots(2,3, figsize = (10,5))
plt1 = sns.boxplot(data['price'], ax = axs[0,0])
plt2 = sns.boxplot(data['area'], ax = axs[0,1])
plt3 = sns.boxplot(data['bedrooms'], ax = axs[0,2])
plt1 = sns.boxplot(data['bathrooms'], ax = axs[1,0])
plt2 = sns.boxplot(data['stories'], ax = axs[1,1])
plt3 = sns.boxplot(data['parking'], ax = axs[1,2])
plot.tight_layout()detectOutliers()
Price and area have considerable outliers . Next step is to drop the outliers.
# Outlier reduction for priceplot.boxplot(data.price)
Q1 = data.price.quantile(0.25)
Q3 = data.price.quantile(0.75)
IQR = Q3 - Q1
data = data[(data.price >= Q1 - 1.5*IQR) & (data.price <= Q3 + 1.5*IQR)]# Outlier reduction for areaplot.boxplot(data.area)
Q1 = data.area.quantile(0.25)
Q3 = data.area.quantile(0.75)
IQR = Q3 - Q1
data = data[(data.area >= Q1 - 1.5*IQR) & (data.area <= Q3 + 1.5*IQR)]
To verify the outlier is still exists,
detectOutliers()
Data Visualization
sns.pairplot(data)
plot.show()
Next step visualizing the categorical variables
plot.figure(figsize=(20, 12))
plot.subplot(3,3,1)
sns.boxplot(x='mainroad', y='price', data=data)
plot.subplot(3,3,2)
sns.boxplot(x='guestroom', y='price', data=data)
plot.subplot(3,3,3)
sns.boxplot(x='basement', y='price', data=data)
plot.subplot(3,3,4)
sns.boxplot(x='hotwaterheating', y='price', data=data)
plot.subplot(3,3,5)
sns.boxplot(x='airconditioning', y='price', data=data)
plot.subplot(3,3,6)
sns.boxplot(x='furnishingstatus', y='price', data=data)
plot.show()
Data Preparation
As you can see the data for the categorical variables holds yes/no, semi-furnished/unfurnished/furnishsed.
To fit the data in the regression line, we need of numeric data not string.So, need to convert those string values to int.
def toNumeric(x):
return x.map({"no":0,"yes":1})def convert_binary():
for column in list(data.select_dtypes(['object']).columns):
if(column != 'furnishingstatus'):
data[[column]] = data[[column]].apply(toNumeric)convert_binary()
Next, to split the column for furnishingstatus that holds the value in three levels namely furnished/unfurnished/semi-furnished. To implement this need of dummy variables
status = pd.get_dummies(data['furnishingstatus'])
status
Now, you don’t need three columns. You can drop the furnished
column, as the type of furnishing can be identified with just the last two columns where
00
will correspond tofurnished
01
will correspond tounfurnished
10
will correspond tosemi-furnished
To drop the very first column of furnished
status = pd.get_dummies(data['furnishingstatus'], drop_first=True)
Concat the status and main data frame as below,
data = pd.concat([data, status], axis=1)
Remove the column furnishing status which is no longer needed.
data.drop(columns='furnishingstatus',inplace=True)
After all the changes, the data frame looks like
Includes the fields other than prices for the X data frame. For Y include the price field alone.
Y = data.price# includes the fields other than prices
X = data.iloc[:,1:]
MultiCollinearity
Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model.
Let me take a simple example from our everyday life to explain this. Colin loves watching television while munching on chips. The more television he watches, the more chips he eats and the happier he gets!
Now, if we could quantify happiness and measure Colin’s happiness while he’s busy doing his favourite activity, which do you think would have a greater impact on his happiness? Having chips or watching television? That’s difficult to determine because the moment we try to measure Colin’s happiness from eating chips, he starts watching television. And the moment we try to measure his happiness from watching television, he starts eating chips.
Eating chips and watching television are highly correlated in the case of Colin and we cannot individually determine the impact of the individual activities on his happiness. This is the multicollinearity problem!
from sklearn.preprocessing import MinMaxScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
def preprocessing(X):
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
variables = X_scaled
vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
vif["Features"] = X.columns
print(vif)
Passing the variables to check the multicollinearity is exists,
preprocessing(X)
As a thumb rule, a VIF value greater than 5 means very severe multicollinearity. From the above results area and bedrooms having severe collinearity.
We need to drop those columns and confirm the collinearity is still exists.
X.drop(['area','bedrooms'], axis=1, inplace=True)
preprocessing(X)
Finally, there is no multicollinearity exists from the dataset. So we can proceed on the next step on splitting the training and testing sets.
Splitting the Data into Training and Testing Sets
Applying X and Y for training and test dataset with the respective coordinates as x_train & x_test.
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size = 0.25,random_state=355)
Create the LinearRegression model reference. Using the reference fit the line with training data. Now, we have well prepared our dataset in order to provide training, which means we will fit our regression model to the training set.
from sklearn.linear_model import LinearRegression
regression = LinearRegression()
regression.fit(x_train,y_train)
Make prediction
y_predict = regression.predict(x_test)
Plotting y_test and y_pred to understand the spread.
plot.scatter(y_test,y_predict)
fig.suptitle('y_test vs y_pred', fontsize=20)
plot.xlabel('y_test', fontsize=18)
plot.ylabel('y_pred', fontsize=16)
import statsmodels.api as sms
model_1 = sms.OLS(y_train, x_train).fit()
model_1.summary()
This model has a higher value of R-squared (0.954), which means that this model explains more variance and provides a better fit to the data.
Hope everyone got some ideas on how to implement the Linear regression model on multiple independent variables.
We will catch up with another interesting topic in the coming days.
Happy learning :)