A simple deep learning example of house prediction using Sequential and Dense Layers of Keras.

Given data

House sale prices for King County homes sold between May 2014 and May 2015.

Problem

Build a model that predicts the price of a house, given set of features of the house.

Dataset Link

https://www.kaggle.com/harlfoxem/housesalesprediction

Solution Implementation

Tensorflow 2.0 with Keras — ReLU

  1. Loading all the Libraries
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

2.Exploratory Data Analysis

data = pd.read_csv('DATA/kc_house_data.csv')

(i)Peak of Data

data.head()
Image for post
Image for post

Observations :

  • Each house is given unique ID , which will not be useful for our modelling.
  • Date of purchase is given, day may not be useful, but we can analyze month and year.

(ii)Check stats and any nulls in data

data.isnull().sum()
Image for post
Image for post
data.describe().transpose()
Image for post
Image for post

Observations :

  • No Null Values present in entire dataset.
  • There are total of 21 feature columns
  • We are unable to infer much of information from the stats, but Price ranges from 78,000 USD to 77,00,000 USD
  • Visualize the distribution of Price to understand.
  • Date of purchase is given, day may not be useful, but we can analyze month and year.

(iii)Examine the distribution of Price

plt.figure(figsize=(15,8))
sns.distplot(data['price'])
plt.show()
Image for post
Image for post

Observations :

  • Most of the House fall in price range of 78,000–3,00,000 roughly and rest of the house prices/houses above can be considered as outliers.
  • Also the peak is somewhere around 50,000 USD, meaning the budget house price in the area is 50K USD.

(iv)Correlation of Features with respect to Price

data.corr()['price'].sort_values()
Image for post
Image for post

Observations — Crucial Features :

  • Square Feet is highly correlated. Intuitively, even in India, we purchase Flats by per SFT Price, rest will be amenities.
  • Another factor to be considered is Number of Bedrooms as it is also crucial in general.
  • Latitude and Longitude is also crucial to decide the area / locality in deciding the prices.
  • ZipCode/Pincode helps in identifying the location such as city/town etc.
  • Year Renovated is another key factor, as normally the latest renovated houses price higher than those of never.(excluding the newly built houses).
  • One more interesting feature is Water Front ,the water/sea facing houses.

(v)Analysis of Living Area SFT with respect to Price

plt.figure(figsize=(8,4))
sns.scatterplot(x='price',y='sqft_living',data=data)
plt.show()
Image for post
Image for post

Observations

  • As examined previously, most of the House fall in price range of 78,000–3,00,000 USD and notable outliers after 400K USD.

Analysis of Bedrooms with respect to Price

plt.figure(figsize=(15,8))
sns.boxplot(x='bedrooms',y='price',data=data)
plt.show()
Image for post
Image for post
sns.countplot(data['bedrooms'])
plt.show()

Observations

  • From the Box-Plot , we are unable to infer anything much, so we can draw a box plot to understand the bedrooms.
  • Most of the houses are 3BHK followed by 4 and 2.
  • Though bedrooms is numeric, it is a discrete number.

(vi)Latitude and Longitude with Price ranges

plt.figure(figsize=(12,8))
sns.scatterplot(x='long',y='lat',data=data,hue='price')
plt.show()
Image for post
Image for post

Observations :

  • Though the graph is a bit unclear, but we can see darkspots roughly at (-122.2,47.6) indicating higher price around the area.
  • May be the outliers influence is making bit harder to understand, let us remove very high price houses.

Check top 20 highly priced values :

data.sort_values('price',ascending=False).head(20)['price']
Image for post
Image for post

Observations :

  • Just from 20 observations the price collapsed from 7700000 USD to 3640000, a 50 % reduction in price,which needs to be adressed.
  • We can remove 1 % of total data of the highly priced houses.
trimmed_data = data.sort_values('price',ascending=False).iloc[int(np.ceil(len(data) * 0.01)) :] #99 % of data with increasing priceplt.figure(figsize=(12,8))
sns.scatterplot(x='long',y='lat',data=trimmed_data,hue='price',alpha=0.1,palette='RdYlGn')
plt.show()
Image for post
Image for post

Observations :

  • Now , we can see green patches/dots which are around a water body.
  • Let us understand price relation with waterfront feature

(vii)Impact of Water Front House with Price

sns.boxplot(x='waterfront',y='price',data=data)
Image for post
Image for post

Observations :

  • As expected, houses which are with waterfront are bit more pricy than those of don’t.

3.Feature Engineering

  • House ID is not going to influence the house price.
  • The Date column can be broken down into year and month, in a hope to find any trend in the housing prices.
  • We also need to check ZipCode and Year Renovated Columns and check if they can be hot-encoded.
data = data.drop('id',axis=1)
data['date'] = pd.to_datetime(data['date'])
data['year'] = data['date'].apply(lambda date : date.year)
data['month'] = data['date'].apply(lambda date : date.month)
data = data.drop('date',axis=1)

(i)Impact of Month,Year on Price

plt.figure(figsize=(12,8))
sns.boxplot(x='month',y='price',data=data)
plt.show()
Image for post
Image for post
data.groupby('month').mean()['price'].plot()
plt.show()
Image for post
Image for post
data.groupby('year').mean()['price'].plot()
plt.show()
Image for post
Image for post

Observations :

  • Price shows some fluctuations with Month, even though it only varies within 510k and 560K we will keep it.
  • Price with Year always kept on increasing, indicating more movement of people, so as the demand of houses.

(ii)Zipcode with Price

data['zipcode'].value_counts()
Image for post
Image for post
data = data.drop('zipcode',axis=1)

Observations :

  • There are total 70 zip codes, definitely we need to hot-encode these if we are considering them as they should not be treated as numeric, resulting in 90 features.
  • Another way is manual grouping based on geo-graphical grouping of zip codes to groups less than 10.

(iii)Year Renovated with Price

data['yr_renovated'].value_counts()
Image for post
Image for post
print("Median Price of Homes renovated in 2014",data[(data['yr_renovated']==2014)]['price'].median())
print("Median Price of Homes renovated in 2013" ,data[(data['yr_renovated']==2013)]['price'].median())
print("Median Price of Homes renovated in 2012" ,data[(data['yr_renovated']==2012)]['price'].median())
Image for post
Image for post

Observations :

  • By Intuition, newly renovated homes are more priced than the others.
  • We considered median , to prevent the outlier effect, if we use mean, we should consider all the features of the house(or similar featured houses).
  • So, if higher the year, the more the price.
  • There is no need to encode them as categories as the number itself is having direct relationship with price.

4.Model Building

X = data.drop('price',axis=1).values
y = data['price'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = Sequential()
model.add(Dense(19,activation='relu')) # We are using 19 units for 19 Features
model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(1))
model.compile(optimizer='adam',loss='mse')
model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=128,epochs=400,verbose=0)

5.Model Evaluation

losses = pd.DataFrame(model.history.history)
losses.plot()
plt.show()
Image for post
Image for post
predictions = model.predict(X_test)
print("Mean Squared Error =",mean_squared_error(y_test,predictions))
print("Mean Absolute Error= ",mean_absolute_error(y_test,predictions))
print("Explained Variance = ",explained_variance_score(y_test,predictions))
Image for post
Image for post
plt.figure(figsize=(12,6))
plt.scatter(y_test,predictions)
plt.plot(y_test,y_test,'r')
plt.show()
Image for post
Image for post

Given the Mean value of Price as 540 K USD , our model predicts with a 101 K variation, which is an OK model, but not best in real-time ML Scenarios.

Reference :

Written by

Professional ML Developer | DL Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store