Helping a Gym Rat to choose a Neighborhood to live in Manhattan -Part 2

Sudeepa Nadeeshan
8 min readApr 30, 2020

--

Photo by NASA on Unsplash

This is the second part of the analysis. We identified the Neighborhood “Tribeca” based on the Gym Data in there. Please read the first part if you are interested in this link.

Average Price Prediction

The downloaded manhattan dataset was in .xlsx format and it has separate sheets for each year while there are 6 features in a sheet. Those are marked as Number 2 in Fig 1. The data set has data from 2005 to 2018.

Figure 1 — Manhattan Price Data Summary

The sheet was manually (don’t worry there are only few rows and sometimes manual is the fastest 😉) processed to one sheet adding a ‘YEAR’ column to the data. Let’s load data using pandas.

Data Understanding and Preprocessing

Let’s remove unnecessary augmented data, Null values from the dataframe.

Now let’s analyze our ‘Target Variable’, ‘AVERAGE SALE PRICE’. First, we can see the summarized description of the target as follows.

Visualizing the statistics helps us understand the data better. The following histogram shows average prices in x-axis and its count/ frequency on the y-axis. (We have binned X values to 50 bins)

We can see that most of the values are distributed between 0–1.5 (10⁷). Let’s use the box plot analysis to identify outliers more clearly.

A box plot is nothing but a simple representation of the distribution of the data that you have. The following image describes the features in the box plot.

Figure 2 — Box Plot

Let’s do some grade seven maths here. The range between 75th(Q1) and 25th (Q3) percentile is considered as the Interquartile Range and the maximum(possible) and minimum values are calculated as below in this framework.

IQR = Q3 - Q1
Maximum = Q3 + 1.5*IQR
Minimum = Q1 - 1.5*IQR

There can be many reasons to cause values above the Maximum and below the Minimum. These observations are significantly different from the others. Removing those outliers helps the models to predict better. So let’s remove them.

We have successfully removed the outliers. Let’s explore the “Neighborhood” data in this data set. We can list the unique neighborhoods as follows.

The highlighted items are the same but unwanted white spaces have resulted in multiple unique values for the same neighborhoods. We have to remove those.

Now we can see that the number of neighborhoods have been reduced to 34 from 55!

The neighborhood field has categorical features. We need to represent this in a numerical format in order to analyze further. We will be using One Hot Encoding on Neighborhood data to get the numerical representation.

There is another field that also has a higher impact on the price. That is ‘TYPE OF HOME’. Let’s explore it.

Again we can see the same issue as in neighborhood feature. WHITE SPACES! Also, keep in mind that the price is positively increasing when the ‘Family Number’ is increasing. With that in mind, let’s encode this field using Label Encoding.

We have encoded the data as follows.

01 One FAMILY HOMES -  1
02 Two FAMILY HOMES - 2
03 Three FAMILY HOMES - 3

Data Scaling.

Data Scaling helps to train faster, reduce overfitting and improve the accuracies of the machine learning models. This is considered as an important preprocessing step in Deep learning. There are 2 main scaling methods (You may find plenty of articles about these techniques on the internet).

  1. Data Normalization
  2. Data Standardization.

We are going to use both Normalization and Standardization techniques on the data.

Let’s Normalize input data using MinMaxScaler from scikit-learn library. We can scale ‘YEAR’, ‘TYPE OF HOME’ and ‘NUMBER OF SALES’ features into 0–1 range.

The target variable is Standardized using StandardScaler() in the scikit-learn library. This helps the models to focus more on the variations in the target variable.

Train/ Test Split.

We have to divide the data set for training and testing. I am going to create two sets of train/test data. The first data set (Set 1)is designed to test the predictability of this specific problem with the given data features. Set 2 has kept 2018 data for validation and split the other data 7/3 ration for training and testing.

Set 1

Set 2

Scaled value 1 represents the maximum in the year feature and that is the year 2018. We keep that for validation purposes.

Modeling

Well, now all the wrappers are out. Let’s enjoy the pie. It’s time to build the prediction models. I tried linear and nonlinear regression models for prediction. Please note the models were configured to their default configurations and evaluated using “Root Mean Squared Error” (Note that Target is standardized).

Liner Models

  • Linear Regression — 2449866754.0338902
  • Lasso Linear Regression — 1.0075
  • Ridge Regression — 0.57
  • Elastic Net Regression — 1.00
  • Huber Regression — 0.58
  • Lasso Lars Linear Regression — 1.00
  • Passive Aggressive Regression — 1.40
  • Stochastic Gradient Descent Regression — 0.60

Nonlinear Models

  • k-Nearest Neighbors — 0.69
  • Classification and Regression Trees — 0.80
  • Extra Tree — 0.81
  • Support Vector Regression -0.71

Trying out each and every possible model and comparing the results is not a good practice. It’s always better to have a theoretical background and to know why this model performs better in the specific type of problem (Especially in deep learning). Yet our goal here is to find the best model without spending much time. That’s why I used a brute force approach for this problem.

In addition to the above models, I tried a two-layer (with no hidden layers) Neural Network for our regression problem. It resulted a RMS value of 0.563 so far the best!

Please note that I did not apply any Hyper Parameter Optimization techniques on the NN model (eg. Grid Search/ Random Search, Trial & Error or Analyzing Literature) or any techniques to stop overfitting (eg. Cross-validation ). Our Aim here is to see the possibility of the predicting target value using the given features.

We can also try an AutoML approach (an automated brute force approach) to find the best model and hyperparameters without performing an exhaustive manual method. I will leave it as a future work.

Visualization

Let’s try to visualize the prediction for the year 2018 vs the actual average prices.

Figure 3 — Prediction Visualization

It seems the model has the capability of following the general pattern in the target while missing the spikes. That could be a reason to have higher RMS values.

So What will be the cost for Sam?

Prediction

We previously modeled the data with YEAR, TYPE OF HOME and NUMBER OF SALES (to see the predictability of this problem). But when we are predicting the average price for a new year, we only have YEAR and TYPE OF HOME that we are looking for. We don’t have the NUMBER OF SALES. So lets train the model with YEAR AND TYPE OF HOME features. We will be using the model type with the best accuracy (e.g. NN ) assuming that it still can predict better even one feature is reduced.

I did forget to mention that my friend Sam is single 😀. He is looking for a ‘One Family Home’. So our input variables will be,

YEAR = 2020
TYPE OF HOME = 1
NEIGHBORHOOD= 'Tribeca'

We should transform the first two features to their scaled values and apply One Hot encoding to the ‘NEIGHBORHOOD’ feature (using Scikit-learn pipelines for the process is the better option. but doing it manually as we don’t have to worry about the reproducibility ATM).

Now let’s predict the price and transform back to the initial scale.

Ok. Sam you will have to pay $4093886.2 in average to buy a one-family home in ‘Tribeca’ 2020. Is that too much compared to the past years? Let’s compare that with past prices.

Figure 4 — Price Comparison

Conclusion

I think we now have done our part. Sam now knows where he should live and what would be the average cost of the house. Let’s keep my imaginary friend Sam aside. This story was written as a part of the ‘Applied Data Science Capstone’ course offered by IBM. I defined the problem and analyzed relevant data sources.The code is available on Github. Please do mention if there are any issues, problems or any suggestions you have. I used my free time in the covid-19 quarantine period to write this. So guys #StaySafe and #StayHome! Use your time to do some analysis.

--

--