🏠 House Prices Prediction using Random Forest

Published in

hackerdawn

6 min readMay 11, 2021

A house’s price can depend on surprisingly weird features. We will try to predict a house’s price through its 79 features. For this purpose, we’ll be using the House Prices dataset from Kaggle.

Importing Libraries

Let’s first import the required libraries. If you don’t have a particular library installed, run the command ‘pip install <package_name>’ to install it.

import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import skew
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

Loading the Datasets

We’ll have downloaded the data from Kaggle and unzipped it in a directory named house_prices. Let’s see what went inside the directory where we unzipped the data.

for dirname, _, filenames in os.walk('./house_prices'):
    for filename in filenames:
        print(filename)

We will load the train and test dataset. Further, we’ll print their shapes.

train = pd.read_csv('./house_prices/train.csv')
test = pd.read_csv('./house_prices/test.csv')print(f'Train shape : {train.shape}')
print(f'Test shape : {test.shape}')

Shapes of train & test datasets

Visualizing the Dataset

We’ll print the heads of train and test datasets to see how they actually look.

train.head(10)

test.head(10)

We’ll use the describe function to get a description of the ‘SalesPrice’ column.

print(train['SalePrice'].describe())

Let’s plot the distribution of SalesPrice using a histogram. We can clearly observe that the distribution is right-skewed.

sns.histplot(train['SalePrice'],kde=True)

We’ll use the log transformation to remove the skewness from the distribution.

train['SalePrice'] = np.log1p(train['SalePrice'])
sns.histplot(train['SalePrice'],kde=True)

We will plot a heatmap to denote the highly correlated features. The higher the number in the block, the higher is the correlation.

corr = train.corr()
highly_corr_features = corr.index[abs(corr["SalePrice"])>0.5]
plt.figure(figsize=(10,10))
map = sns.heatmap(train[highly_corr_features].corr(),annot=True,cmap="RdYlGn")

Heatmap showing highly Correlated features

Let’s see the top 10 correlated features with SalesPrice.

corr["SalePrice"].sort_values(ascending=False).head(10)

Let’s plot some features against the SalesPrice to see how they change w.r.t SalesPrice.

fig = plt.figure(figsize=(12,10))#GarageArea
plt.subplot(321)
sns.scatterplot(data=train, x='GarageArea', y="SalePrice")#YearBuilt
plt.subplot(322)
sns.scatterplot(data=train, x='YearBuilt', y="SalePrice")#WoodDeckSF
plt.subplot(323)
sns.scatterplot(data=train, x='WoodDeckSF', y="SalePrice")#OverallQual
plt.subplot(324)
sns.scatterplot(data=train, x='OverallQual', y="SalePrice")#BsmtUnfSF
plt.subplot(325)
sns.scatterplot(data=train, x='BsmtUnfSF', y="SalePrice")#TotalBsmtSF
plt.subplot(326)
sns.scatterplot(data=train, x='TotalBsmtSF', y="SalePrice")

Scatter Plots for different features v’s SalePrice

We will concatenate the train and test datasets to make the preprocessing easy. Later, we will divide them again.

data = pd.concat([train,test], axis=0)
y_train = train['SalePrice']
data = data.drop(['Id', 'SalePrice'], axis=1)
print(data.shape)

Shaped of Combined data

Let’s get information about the concatenated dataframe.

data.info()

To see the distribution of data types across the columns, we will plot a pie chart.

data.dtypes.value_counts().plot.pie()

Let’s print the number of unique values in each column.

print('UNIQUE VALUES\n')
for col in data.columns:
    print(f'{col}: {len(data[col].unique())}\n')

We’ll describe each column in data apart from the ones with the object data type.

data[data.select_dtypes(exclude='object').columns].describe()

Description of non-object columns (Truncated)

Let’s plot a heatmap so that we can clearly see the null values present across all the columns.

#Visualizing the null values in all columns
plt.figure(figsize=(30,8));
sns.heatmap(data.isnull(), cmap='flare');

We’ll print the total count and percentage of null values in the columns.

#Columns containing most null values
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum() / data.isnull().count()*100).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
print(missing_data.head(10))

Feature Engineering

If we observe the features with missing values of more than 5, we‘ll note that they are not important and none of them correlates> 0.5. So, we can delete them without losing any significant detail.

We’ll then see, which remaining columns have null values.

#Dropping columns with > 5 null values
data.drop((missing_data[missing_data['Total'] > 5]).index, axis=1, inplace=True)#Sorting columns w.r.t null values
total = data.isnull().sum().sort_values(ascending=False)
total.head(20)

We will now fill the missing values in numeric columns with 0’s and the missing values in categorical columns with the most frequently occurring value (mode).

We’ll also delete the column ‘Utilities’ as it contains only one value, i.e ‘AllPub’.

#Filling the numeric data
numeric_missed = ['BsmtFinSF1',
'BsmtFinSF2',
'BsmtUnfSF',
'TotalBsmtSF',
'BsmtFullBath',
'BsmtHalfBath',
'GarageArea',
'GarageCars']for feature in numeric_missed:
    data[feature] = data[feature].fillna(0)#Filling the categorical data
categorical_missed = ['Exterior1st',
'Exterior2nd',
'SaleType',
'MSZoning',
'Electrical',
'KitchenQual',
'Functional']for feature in categorical_missed:
    data[feature] = data[feature].fillna(data[feature].mode()[0])#Deleting 'Utilities' column
data.drop(['Utilities'], axis=1, inplace=True)

Now, let’s see if we still have any columns with null values. As shown in the output, there are no more null values left.

#Checking for any remaining null values
data.isnull().sum().max()

Max count of null values

Let’s find the most skewed columns in data.

#Top skewed columns
numeric_features = data.dtypes[data.dtypes != 'object'].index
skewed_features = data[numeric_features].apply(lambda x: skew(x)).sort_values(ascending=False)
high_skew = skewed_features[abs(skewed_features) > 0.5]
print(high_skew)

Let’s apply log transformations to all these skewed columns.

#Transforming skewed columns
for feature in high_skew.index:
    data[feature] = np.log1p(data[feature])

Let convert the categorical variables into numeric variables. We’ll do this using the get_dummies() method.

#Converting categorical data to numerical
data = pd.get_dummies(data)
data.head()

We’ll split back the data back into train and test. We will take into use the y_train which we had created in the beginning.

#Dividing data back into train & test
train =data[:len(y_train)]
test = data[len(y_train):]#Printing thier shapes
print(train.shape, test.shape)

Train & Test shapes

We’ll further break the train data into x_train, x_test, y_train, and y_test so that we can measure our model’s performance.

x_train, x_test, y_train, y_test = train_test_split(train, y_train, test_size=0.2, random_state=42)

Creating the Model

Let’s define our model now. For this, we’ll use RandomForestRegressor from sklearn.ensemble.

clf = RandomForestRegressor(n_estimators=300)

We’ll use x_train and y_train to fit our model.

clf.fit(x_train, y_train)

Let’s calculate our model’s score using x_test and y_test.

clf.score(x_test,y_test)

Test score

Prediction

We have fitted the model and seen its performance. Let us predict the prices for the houses in the actual test data.

#Making an prediction
prediction = clf.predict(test)
print(prediction)

Prediction on Actual Test data

Since we had applied normalization on the SalesPrice column previously, we’ll use the exponent function to convert the prediction into real-world values.

#Applying reverse of log, i.e exp
np.expm1(prediction)

After exponential transformation

We have completed the prediction of house prices. If you liked this tutorial, do leave a clap!