## Given data

House sale prices for King County homes sold between May 2014 and May 2015.

## Problem

Build a model that predicts the price of a house, given set of features of the house.

https://www.kaggle.com/harlfoxem/housesalesprediction

## Solution Implementation

Tensorflow 2.0 with Keras — ReLU

`import pandas as pdimport numpy as npimport seaborn as snsfrom matplotlib import pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import MinMaxScalerfrom tensorflow.keras.models import Sequentialfrom tensorflow.keras.layers import Dense`

2.Exploratory Data Analysis

`data = pd.read_csv('DATA/kc_house_data.csv')`

(i)Peak of Data

`data.head()`

Observations :

• Each house is given unique ID , which will not be useful for our modelling.
• Date of purchase is given, day may not be useful, but we can analyze month and year.

(ii)Check stats and any nulls in data

`data.isnull().sum()`
`data.describe().transpose()`

Observations :

• No Null Values present in entire dataset.
• There are total of 21 feature columns
• We are unable to infer much of information from the stats, but Price ranges from 78,000 USD to 77,00,000 USD
• Visualize the distribution of Price to understand.
• Date of purchase is given, day may not be useful, but we can analyze month and year.

(iii)Examine the distribution of Price

`plt.figure(figsize=(15,8))sns.distplot(data['price'])plt.show()`

Observations :

• Most of the House fall in price range of 78,000–3,00,000 roughly and rest of the house prices/houses above can be considered as outliers.
• Also the peak is somewhere around 50,000 USD, meaning the budget house price in the area is 50K USD.

(iv)Correlation of Features with respect to Price

`data.corr()['price'].sort_values()`

Observations — Crucial Features :

• Square Feet is highly correlated. Intuitively, even in India, we purchase Flats by per SFT Price, rest will be amenities.
• Another factor to be considered is Number of Bedrooms as it is also crucial in general.
• Latitude and Longitude is also crucial to decide the area / locality in deciding the prices.
• ZipCode/Pincode helps in identifying the location such as city/town etc.
• Year Renovated is another key factor, as normally the latest renovated houses price higher than those of never.(excluding the newly built houses).
• One more interesting feature is Water Front ,the water/sea facing houses.

(v)Analysis of Living Area SFT with respect to Price

`plt.figure(figsize=(8,4))sns.scatterplot(x='price',y='sqft_living',data=data)plt.show()`

Observations

• As examined previously, most of the House fall in price range of 78,000–3,00,000 USD and notable outliers after 400K USD.

Analysis of Bedrooms with respect to Price

`plt.figure(figsize=(15,8))sns.boxplot(x='bedrooms',y='price',data=data)plt.show()`
`sns.countplot(data['bedrooms'])plt.show()`

Observations

• From the Box-Plot , we are unable to infer anything much, so we can draw a box plot to understand the bedrooms.
• Most of the houses are 3BHK followed by 4 and 2.
• Though bedrooms is numeric, it is a discrete number.

(vi)Latitude and Longitude with Price ranges

`plt.figure(figsize=(12,8))sns.scatterplot(x='long',y='lat',data=data,hue='price')plt.show()`

Observations :

• Though the graph is a bit unclear, but we can see darkspots roughly at (-122.2,47.6) indicating higher price around the area.
• May be the outliers influence is making bit harder to understand, let us remove very high price houses.

Check top 20 highly priced values :

`data.sort_values('price',ascending=False).head(20)['price']`

Observations :

• Just from 20 observations the price collapsed from 7700000 USD to 3640000, a 50 % reduction in price,which needs to be adressed.
• We can remove 1 % of total data of the highly priced houses.
`trimmed_data = data.sort_values('price',ascending=False).iloc[int(np.ceil(len(data) * 0.01)) :] #99 % of data with increasing priceplt.figure(figsize=(12,8))sns.scatterplot(x='long',y='lat',data=trimmed_data,hue='price',alpha=0.1,palette='RdYlGn')plt.show()`

Observations :

• Now , we can see green patches/dots which are around a water body.
• Let us understand price relation with waterfront feature

(vii)Impact of Water Front House with Price

`sns.boxplot(x='waterfront',y='price',data=data)`

Observations :

• As expected, houses which are with waterfront are bit more pricy than those of don’t.

3.Feature Engineering

• House ID is not going to influence the house price.
• The Date column can be broken down into year and month, in a hope to find any trend in the housing prices.
• We also need to check ZipCode and Year Renovated Columns and check if they can be hot-encoded.
`data = data.drop('id',axis=1)data['date'] = pd.to_datetime(data['date'])data['year']  = data['date'].apply(lambda date : date.year)data['month'] = data['date'].apply(lambda date : date.month)data = data.drop('date',axis=1)`

(i)Impact of Month,Year on Price

`plt.figure(figsize=(12,8))sns.boxplot(x='month',y='price',data=data)plt.show()`
`data.groupby('month').mean()['price'].plot()plt.show()`
`data.groupby('year').mean()['price'].plot()plt.show()`

Observations :

• Price shows some fluctuations with Month, even though it only varies within 510k and 560K we will keep it.
• Price with Year always kept on increasing, indicating more movement of people, so as the demand of houses.

(ii)Zipcode with Price

`data['zipcode'].value_counts()`
`data = data.drop('zipcode',axis=1)`

Observations :

• There are total 70 zip codes, definitely we need to hot-encode these if we are considering them as they should not be treated as numeric, resulting in 90 features.
• Another way is manual grouping based on geo-graphical grouping of zip codes to groups less than 10.

(iii)Year Renovated with Price

`data['yr_renovated'].value_counts()`
`print("Median Price of Homes renovated in 2014",data[(data['yr_renovated']==2014)]['price'].median())print("Median Price of Homes renovated in 2013" ,data[(data['yr_renovated']==2013)]['price'].median())print("Median Price of Homes renovated in 2012" ,data[(data['yr_renovated']==2012)]['price'].median())`

Observations :

• By Intuition, newly renovated homes are more priced than the others.
• We considered median , to prevent the outlier effect, if we use mean, we should consider all the features of the house(or similar featured houses).
• So, if higher the year, the more the price.
• There is no need to encode them as categories as the number itself is having direct relationship with price.

4.Model Building

`X = data.drop('price',axis=1).valuesy = data['price'].valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)scaler = MinMaxScaler()X_train = scaler.fit_transform(X_train)X_test = scaler.transform(X_test)model = Sequential()model.add(Dense(19,activation='relu')) # We are using 19 units for 19 Featuresmodel.add(Dense(19,activation='relu'))model.add(Dense(19,activation='relu'))model.add(Dense(19,activation='relu'))model.add(Dense(1))model.compile(optimizer='adam',loss='mse')model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=128,epochs=400,verbose=0)`

5.Model Evaluation

`losses = pd.DataFrame(model.history.history)losses.plot()plt.show()`
`predictions = model.predict(X_test)print("Mean Squared Error =",mean_squared_error(y_test,predictions))print("Mean Absolute Error= ",mean_absolute_error(y_test,predictions))print("Explained Variance = ",explained_variance_score(y_test,predictions))`
`plt.figure(figsize=(12,6))plt.scatter(y_test,predictions)plt.plot(y_test,y_test,'r')plt.show()`

Written by