šŸ  House Prices Prediction using Random Forest

Sidharth Pandita
hackerdawn
Published in
6 min readMay 11, 2021
Photo by Ralph Kelly on Unsplash

A houseā€™s price can depend on surprisingly weird features. We will try to predict a houseā€™s price through its 79 features. For this purpose, weā€™ll be using the House Prices dataset from Kaggle.

Importing Libraries

Letā€™s first import the required libraries. If you donā€™t have a particular library installed, run the command ā€˜pip install <package_name>ā€™ to install it.

import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import skew
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

Loading the Datasets

Weā€™ll have downloaded the data from Kaggle and unzipped it in a directory named house_prices. Letā€™s see what went inside the directory where we unzipped the data.

for dirname, _, filenames in os.walk('./house_prices'):
for filename in filenames:
print(filename)
Filenames

We will load the train and test dataset. Further, weā€™ll print their shapes.

train = pd.read_csv('./house_prices/train.csv')
test = pd.read_csv('./house_prices/test.csv')
print(f'Train shape : {train.shape}')
print(f'Test shape : {test.shape}')
Shapes of train & test datasets

Visualizing the Dataset

Weā€™ll print the heads of train and test datasets to see how they actually look.

train.head(10)
Train Head (Truncated)
test.head(10)
Test Head (Truncated)

Weā€™ll use the describe function to get a description of the ā€˜SalesPriceā€™ column.

print(train['SalePrice'].describe())
Describing ā€˜SalesPriceā€™ Column

Letā€™s plot the distribution of SalesPrice using a histogram. We can clearly observe that the distribution is right-skewed.

sns.histplot(train['SalePrice'],kde=True)
Histogram for SalesPrice Distribution

Weā€™ll use the log transformation to remove the skewness from the distribution.

train['SalePrice'] = np.log1p(train['SalePrice'])
sns.histplot(train['SalePrice'],kde=True)
After Normalization

We will plot a heatmap to denote the highly correlated features. The higher the number in the block, the higher is the correlation.

corr = train.corr()
highly_corr_features = corr.index[abs(corr["SalePrice"])>0.5]
plt.figure(figsize=(10,10))
map = sns.heatmap(train[highly_corr_features].corr(),annot=True,cmap="RdYlGn")
Heatmap showing highly Correlated features

Letā€™s see the top 10 correlated features with SalesPrice.

corr["SalePrice"].sort_values(ascending=False).head(10)
Correlated features in ascending order

Letā€™s plot some features against the SalesPrice to see how they change w.r.t SalesPrice.

fig = plt.figure(figsize=(12,10))#GarageArea
plt.subplot(321)
sns.scatterplot(data=train, x='GarageArea', y="SalePrice")
#YearBuilt
plt.subplot(322)
sns.scatterplot(data=train, x='YearBuilt', y="SalePrice")
#WoodDeckSF
plt.subplot(323)
sns.scatterplot(data=train, x='WoodDeckSF', y="SalePrice")
#OverallQual
plt.subplot(324)
sns.scatterplot(data=train, x='OverallQual', y="SalePrice")
#BsmtUnfSF
plt.subplot(325)
sns.scatterplot(data=train, x='BsmtUnfSF', y="SalePrice")
#TotalBsmtSF
plt.subplot(326)
sns.scatterplot(data=train, x='TotalBsmtSF', y="SalePrice")
Scatter Plots for different features vā€™s SalePrice

We will concatenate the train and test datasets to make the preprocessing easy. Later, we will divide them again.

data = pd.concat([train,test], axis=0)
y_train = train['SalePrice']
data = data.drop(['Id', 'SalePrice'], axis=1)
print(data.shape)
Shaped of Combined data

Letā€™s get information about the concatenated dataframe.

data.info()
Info of Combined data (Truncated)

To see the distribution of data types across the columns, we will plot a pie chart.

data.dtypes.value_counts().plot.pie()
Pie charts for data type distribution

Letā€™s print the number of unique values in each column.

print('UNIQUE VALUES\n')
for col in data.columns:
print(f'{col}: {len(data[col].unique())}\n')
Unique value Count (Truncated)

Weā€™ll describe each column in data apart from the ones with the object data type.

data[data.select_dtypes(exclude='object').columns].describe()
Description of non-object columns (Truncated)

Letā€™s plot a heatmap so that we can clearly see the null values present across all the columns.

#Visualizing the null values in all columns
plt.figure(figsize=(30,8));
sns.heatmap(data.isnull(), cmap='flare');
Column-wise Null count heatmap

Weā€™ll print the total count and percentage of null values in the columns.

#Columns containing most null values
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum() / data.isnull().count()*100).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
print(missing_data.head(10))
Columns containing Most null values

Feature Engineering

If we observe the features with missing values of more than 5, weā€˜ll note that they are not important and none of them correlates> 0.5. So, we can delete them without losing any significant detail.

Weā€™ll then see, which remaining columns have null values.

#Dropping columns with > 5 null values
data.drop((missing_data[missing_data['Total'] > 5]).index, axis=1, inplace=True)
#Sorting columns w.r.t null values
total = data.isnull().sum().sort_values(ascending=False)
total.head(20)
Remaining columns with null values

We will now fill the missing values in numeric columns with 0ā€™s and the missing values in categorical columns with the most frequently occurring value (mode).

Weā€™ll also delete the column ā€˜Utilitiesā€™ as it contains only one value, i.e ā€˜AllPubā€™.

#Filling the numeric data
numeric_missed = ['BsmtFinSF1',
'BsmtFinSF2',
'BsmtUnfSF',
'TotalBsmtSF',
'BsmtFullBath',
'BsmtHalfBath',
'GarageArea',
'GarageCars']
for feature in numeric_missed:
data[feature] = data[feature].fillna(0)
#Filling the categorical data
categorical_missed = ['Exterior1st',
'Exterior2nd',
'SaleType',
'MSZoning',
'Electrical',
'KitchenQual',
'Functional']
for feature in categorical_missed:
data[feature] = data[feature].fillna(data[feature].mode()[0])
#Deleting 'Utilities' column
data.drop(['Utilities'], axis=1, inplace=True)

Now, letā€™s see if we still have any columns with null values. As shown in the output, there are no more null values left.

#Checking for any remaining null values
data.isnull().sum().max()
Max count of null values

Letā€™s find the most skewed columns in data.

#Top skewed columns
numeric_features = data.dtypes[data.dtypes != 'object'].index
skewed_features = data[numeric_features].apply(lambda x: skew(x)).sort_values(ascending=False)
high_skew = skewed_features[abs(skewed_features) > 0.5]
print(high_skew)
Top Skewed Columns

Letā€™s apply log transformations to all these skewed columns.

#Transforming skewed columns
for feature in high_skew.index:
data[feature] = np.log1p(data[feature])

Let convert the categorical variables into numeric variables. Weā€™ll do this using the get_dummies() method.

#Converting categorical data to numerical
data = pd.get_dummies(data)
data.head()
After applying get_dummies (Truncated)

Weā€™ll split back the data back into train and test. We will take into use the y_train which we had created in the beginning.

#Dividing data back into train & test
train =data[:len(y_train)]
test = data[len(y_train):]
#Printing thier shapes
print(train.shape, test.shape)
Train & Test shapes

Weā€™ll further break the train data into x_train, x_test, y_train, and y_test so that we can measure our modelā€™s performance.

x_train, x_test, y_train, y_test = train_test_split(train, y_train, test_size=0.2, random_state=42)

Creating the Model

Letā€™s define our model now. For this, weā€™ll use RandomForestRegressor from sklearn.ensemble.

clf = RandomForestRegressor(n_estimators=300)

Weā€™ll use x_train and y_train to fit our model.

clf.fit(x_train, y_train)
Model with Parameters

Letā€™s calculate our modelā€™s score using x_test and y_test.

clf.score(x_test,y_test)
Test score

Prediction

We have fitted the model and seen its performance. Let us predict the prices for the houses in the actual test data.

#Making an prediction
prediction = clf.predict(test)
print(prediction)
Prediction on Actual Test data

Since we had applied normalization on the SalesPrice column previously, weā€™ll use the exponent function to convert the prediction into real-world values.

#Applying reverse of log, i.e exp
np.expm1(prediction)
After exponential transformation

We have completed the prediction of house prices. If you liked this tutorial, do leave a clap!

--

--