š House Prices Prediction using Random Forest
A houseās price can depend on surprisingly weird features. We will try to predict a houseās price through its 79 features. For this purpose, weāll be using the House Prices dataset from Kaggle.
Importing Libraries
Letās first import the required libraries. If you donāt have a particular library installed, run the command āpip install <package_name>ā to install it.
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import skew
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
Loading the Datasets
Weāll have downloaded the data from Kaggle and unzipped it in a directory named house_prices. Letās see what went inside the directory where we unzipped the data.
for dirname, _, filenames in os.walk('./house_prices'):
for filename in filenames:
print(filename)
We will load the train and test dataset. Further, weāll print their shapes.
train = pd.read_csv('./house_prices/train.csv')
test = pd.read_csv('./house_prices/test.csv')print(f'Train shape : {train.shape}')
print(f'Test shape : {test.shape}')
Visualizing the Dataset
Weāll print the heads of train and test datasets to see how they actually look.
train.head(10)
test.head(10)
Weāll use the describe function to get a description of the āSalesPriceā column.
print(train['SalePrice'].describe())
Letās plot the distribution of SalesPrice using a histogram. We can clearly observe that the distribution is right-skewed.
sns.histplot(train['SalePrice'],kde=True)
Weāll use the log transformation to remove the skewness from the distribution.
train['SalePrice'] = np.log1p(train['SalePrice'])
sns.histplot(train['SalePrice'],kde=True)
We will plot a heatmap to denote the highly correlated features. The higher the number in the block, the higher is the correlation.
corr = train.corr()
highly_corr_features = corr.index[abs(corr["SalePrice"])>0.5]
plt.figure(figsize=(10,10))
map = sns.heatmap(train[highly_corr_features].corr(),annot=True,cmap="RdYlGn")
Letās see the top 10 correlated features with SalesPrice.
corr["SalePrice"].sort_values(ascending=False).head(10)
Letās plot some features against the SalesPrice to see how they change w.r.t SalesPrice.
fig = plt.figure(figsize=(12,10))#GarageArea
plt.subplot(321)
sns.scatterplot(data=train, x='GarageArea', y="SalePrice")#YearBuilt
plt.subplot(322)
sns.scatterplot(data=train, x='YearBuilt', y="SalePrice")#WoodDeckSF
plt.subplot(323)
sns.scatterplot(data=train, x='WoodDeckSF', y="SalePrice")#OverallQual
plt.subplot(324)
sns.scatterplot(data=train, x='OverallQual', y="SalePrice")#BsmtUnfSF
plt.subplot(325)
sns.scatterplot(data=train, x='BsmtUnfSF', y="SalePrice")#TotalBsmtSF
plt.subplot(326)
sns.scatterplot(data=train, x='TotalBsmtSF', y="SalePrice")
We will concatenate the train and test datasets to make the preprocessing easy. Later, we will divide them again.
data = pd.concat([train,test], axis=0)
y_train = train['SalePrice']
data = data.drop(['Id', 'SalePrice'], axis=1)
print(data.shape)
Letās get information about the concatenated dataframe.
data.info()
To see the distribution of data types across the columns, we will plot a pie chart.
data.dtypes.value_counts().plot.pie()
Letās print the number of unique values in each column.
print('UNIQUE VALUES\n')
for col in data.columns:
print(f'{col}: {len(data[col].unique())}\n')
Weāll describe each column in data apart from the ones with the object data type.
data[data.select_dtypes(exclude='object').columns].describe()
Letās plot a heatmap so that we can clearly see the null values present across all the columns.
#Visualizing the null values in all columns
plt.figure(figsize=(30,8));
sns.heatmap(data.isnull(), cmap='flare');
Weāll print the total count and percentage of null values in the columns.
#Columns containing most null values
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum() / data.isnull().count()*100).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
print(missing_data.head(10))
Feature Engineering
If we observe the features with missing values of more than 5, weāll note that they are not important and none of them correlates> 0.5. So, we can delete them without losing any significant detail.
Weāll then see, which remaining columns have null values.
#Dropping columns with > 5 null values
data.drop((missing_data[missing_data['Total'] > 5]).index, axis=1, inplace=True)#Sorting columns w.r.t null values
total = data.isnull().sum().sort_values(ascending=False)
total.head(20)
We will now fill the missing values in numeric columns with 0ās and the missing values in categorical columns with the most frequently occurring value (mode).
Weāll also delete the column āUtilitiesā as it contains only one value, i.e āAllPubā.
#Filling the numeric data
numeric_missed = ['BsmtFinSF1',
'BsmtFinSF2',
'BsmtUnfSF',
'TotalBsmtSF',
'BsmtFullBath',
'BsmtHalfBath',
'GarageArea',
'GarageCars']for feature in numeric_missed:
data[feature] = data[feature].fillna(0)#Filling the categorical data
categorical_missed = ['Exterior1st',
'Exterior2nd',
'SaleType',
'MSZoning',
'Electrical',
'KitchenQual',
'Functional']for feature in categorical_missed:
data[feature] = data[feature].fillna(data[feature].mode()[0])#Deleting 'Utilities' column
data.drop(['Utilities'], axis=1, inplace=True)
Now, letās see if we still have any columns with null values. As shown in the output, there are no more null values left.
#Checking for any remaining null values
data.isnull().sum().max()
Letās find the most skewed columns in data.
#Top skewed columns
numeric_features = data.dtypes[data.dtypes != 'object'].index
skewed_features = data[numeric_features].apply(lambda x: skew(x)).sort_values(ascending=False)
high_skew = skewed_features[abs(skewed_features) > 0.5]
print(high_skew)
Letās apply log transformations to all these skewed columns.
#Transforming skewed columns
for feature in high_skew.index:
data[feature] = np.log1p(data[feature])
Let convert the categorical variables into numeric variables. Weāll do this using the get_dummies() method.
#Converting categorical data to numerical
data = pd.get_dummies(data)
data.head()
Weāll split back the data back into train and test. We will take into use the y_train which we had created in the beginning.
#Dividing data back into train & test
train =data[:len(y_train)]
test = data[len(y_train):]#Printing thier shapes
print(train.shape, test.shape)
Weāll further break the train data into x_train, x_test, y_train, and y_test so that we can measure our modelās performance.
x_train, x_test, y_train, y_test = train_test_split(train, y_train, test_size=0.2, random_state=42)
Creating the Model
Letās define our model now. For this, weāll use RandomForestRegressor from sklearn.ensemble.
clf = RandomForestRegressor(n_estimators=300)
Weāll use x_train and y_train to fit our model.
clf.fit(x_train, y_train)
Letās calculate our modelās score using x_test and y_test.
clf.score(x_test,y_test)
Prediction
We have fitted the model and seen its performance. Let us predict the prices for the houses in the actual test data.
#Making an prediction
prediction = clf.predict(test)
print(prediction)
Since we had applied normalization on the SalesPrice column previously, weāll use the exponent function to convert the prediction into real-world values.
#Applying reverse of log, i.e exp
np.expm1(prediction)
We have completed the prediction of house prices. If you liked this tutorial, do leave a clap!