Model comparison using a noisy dataset -1

Comparison of model performance on the Inside AirBnb dataset for New York City featuring linear and tree based models.

Harsha Goonewardana
5 min readJul 18, 2018
XKCD.com

Overview

As part of a project to predict the yields on AirBnb properties in New York City, I scored the following models:

  1. Linear Regression
  2. Ridge Regression
  3. Lasso Regression
  4. XG Boost (CART model with extra boosting)

Data

Following the lead from a number of researchers in the field, I used the dataset from insideairbnb.com for New York City. A number of publications have suggested that Inside AirBnb data is more accurate than official AirBnb data as there have been issues with the fidelity of the official data in the past.[1]

The dataset contains 47,542 unique listing locations from 20th January 2009 to 15th May 2018. There are 95 separate features in a variety of types and 35 were selected for further examination. The data was scaled before modeled.

The features were tested for normality using the Shapiro-Wilkes test[2] to rank the features.

Wilkes-Shapiro test for feature normality

I used the Yellowbrick extension on the Sckit-Learn library for better visualization. The Rank1D tool evaluates single features or pairs of features using a variety of metrics that score the features on the scale [-1, 1] or [0, 1] allowing them to be ranked.[3]

The feature correlation to the dependant variable, Price, is shown below:

Model Performance

Measurement

R² Score also known as the Coefficient of Determination, is a measurement of the goodness of fit of the predicted values to the observations. The the score has values between 0 and 1 and the closer the score is to one, the better the fit.

Root Mean Square Error (RMSE) measures the amount of errors of a given algorithm and quantifies the observational value not explained by the model. In our case, this is the value of the rental price in dollars not captured by each. The lower the value, the more accurate the model.

Baseline value is the mean of the price variable: $178.60

The models were measured via a five-fold cross validation with R² and MSE as the measurement metrics. The scores of both train and test sets were compared to asses overfitting.

Model Evaluation

Linear Regression (LR) is the true workhorse of modelling and is the first choice of all data scientist and graduate students due to its robustness and ease of use.

LR train score :0.56

LR test score : 0.56

MSE $: 7457.74

RMSE $: 86.36

We see that the model is only afit for 56% of the true values of the observations, which is sub-optimal. The error or the square root of MSE shows that the prices predicted by the model is within $86.36 of the true value. This is a significant impovement on the baseline.

The residuals have a slight positive correlation with the predicted values.

Ridge Regression/Tikhonov regularization is used to regularize linear models by adding a constraint on the oversized influence of the feature coefficients.

source: http://www.thefactmachine.com/ridge-regression/

Where β values are coefficients of the features, λ is a function which reduces the sum of the values of β. [4]

alpha and MSE correlation
Ridge train score : 0.56

Ridge test score :0.56
MSE $: 7461.21RMSE $: 86.38

The Ridge R² scores are not very different from the simple Linear Regression scores indicating that given the low over fitting seen here, parameter tuning will not contribute to further improvement of the score.

XGBoost is a stripped down decision tree which uses gradient boosting for speed and performance.

Attrib: https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052

Decision tree for regression subdivides a dataset into the smallest specified subset while creating an incremental diagramic representation of the splits. This diagram contains nodes and leaves where the node is comprised of many branches representing a value and the leaf node the final decion value of the predicted value.

A decision node in a regression setting has two or more branches each representing values for the attribute tested. Leaf node represents a decision on the numerical target. [5]

The decision point on splitting at a node is predicated on the decrease in standard deviation after a split on an attribute. The algorithm is focused on idnetifying the attribute that returns the highest standard deviation reduction .[6]

The parameters for the best performing XGB model were:

'colsample_bytree': 0.994491623591146,
'gamma': 2.6897757151708213,
'learning_rate': 0.15588658625852608,
'max_depth': 37,
'min_child_weight': 35.44318769130126,
'n_estimators': 31,
'reg_alpha': 12.027891241452048,
'subsample': 0.9637123264136668

The XGBoost performance:

XG Boost Train Score: 0.82XG Boost Test Score: 0.70MSE  $: 5089.33RSME $: 71.34

This is the best score and the lowest prediction as yet. However the model shows more propensity for overfitting than a linear model as indicated by the larger difference in the training and test scores.

In conclusion, the XGB model provided the lowest RSME and the best prediction while the linear model were more robust in terms reducing overfit issues.

Conclusions and next steps

Both classes of model were fairly inaccurate on this dataset, possibly due to the large noise to signal ratio. I will deploy a neural net to see if unsupervised learning can improve the predictability in the next installment of the series.

--

--

Harsha Goonewardana

I am interested in the intersection of data science and international development. Better development outcomes through analysis.