Boston Airbnb Price Prediction

Data Mining- Machine Learning Models with Hyperparameter Tuning

Vishwajit Chaure

Published in

Geek Culture

9 min readAug 31, 2021

Source: https://unsplash.com/photos/mIgsuhokVio

Introduction

Since its inception, Airbnb has been a critical market driver in addressing a low-cost lodging problem by investigating the ancillary revenue stream of residents who could not afford their apartment’s rent. Airbnb now provides a solid, intermediate public framework for connecting supply and demand in the temporary housing and accommodation market. We wish to explore the underlying facts behind the listings of the various properties inside Boston and try to answer specific issues that could assist both Airbnb and potential owners as Airbnb has grown in popularity.

We plan to investigate it from a distinct perspective to see if there is anything we can do to assist property owners in determining whether to advertise their home and what we can do to help them obtain the highest price, which would benefit both the landlord and Airbnb. This project aims to solve this problem, by using machine learning and data mining techniques to predict the base price for properties in Boston. The project can be found in my GitHub Repo if you’re interested.

Data Description

The Dataset used for this project is available on “insideairbnb.com”.

The dataset consists of 74 columns and approximately 3146 rows.

Dataset Name: listings.csv

Few important variables: 

Accommodates: the number of guests the rental can accommodate. 

Bedrooms: number of bedrooms included in the rental.

Bathrooms: number of bathrooms included in the rental.

Price: nightly price for the rental. 

First Review: the date of the first review. 

Last Review: the date of the most recent review. 

Review Score Rating: guests can score properties overall from 1 to 5 stars.

Data Processing

Data Exploration

Fig 1: Boolean and Categorical variables plot

The first task is to check whether boolean and categorical features contain enough instances in each category to make them worth including. It can be seen that several columns only contain one category and can be dropped while Pre-processing.

The columns like ‘calculated host listings count shared rooms’, ‘review scores location’, ‘calendar updated’ has few values in them. These columns will be dropped during pre-processing (Fig 1).

These are text columns with descriptions that will not be useful for prediction. Also, few columns like ‘neighbourhood_group_cleansed’, ‘bathrooms’, ‘calendar_updated’ have more than 75% missing values (Fig 2).

Fig 3: Top 10 High Correlated Features with Target Feature ‘Price’

The features ‘accommodates’, ’bedrooms’ and ‘beds’ are highly correlated with the target variable ‘price’. So, during pre-processing, we can keep only one of these columns (Fig 3).

From Fig 4, It can observe that the ‘Dorchester’ neighborhood has the most listings. We can also say that Boston has the most listings in the central area because neighborhoods like Downtown, South End, Back Bay, South Boston are among the top ten neighborhoods having the most listings.

Fig 4: Number of listings by neighborhood

In Fig 5, We think the peak around late 2020 for ‘Host joining Airbnb’ might be after relaxation of covid restrictions. With the same reasoning, we can see that there is a significant increase in listing getting the first review.

Fig 5: Time Series comparing Host Joining & First Review

Data Pre-processing

The first task we performed here was finding out the number of missing values in the dataset and then imputing them. We then imputed the percentage of missing values in each column and decided to get rid of columns that have more than 70% of missing values.

NLP will not be used in the creation of an initial model (although they could be used to augment the model later, e.g., through sentiment analysis). Therefore, free text columns will be dropped for now, as will other columns which are not useful for predicting price (e.g., URL, Hostname, and other host-related features that are unrelated to the property).

The ‘host_listings_count’ and ‘host_total_listings_count’ are the same in all but 78 cases. These cases are those where the value is NaN. Therefore, one of these columns can be dropped. Other columns which split these into types of property will also be dropped, as they will be highly correlated (one will be the total of the others).

There are multiple columns for the property location, including an attempt by the site that originally scraped the data to clean up the neighborhood locations. Some of these columns can be dropped. Because all the listings are in Boston, columns relating to city and country can be dropped. One column for the area will be kept ‘neighborhood cleansed’.

There are multiple columns for minimum and maximum night stays, but the two main ones will be used as there are few differences between e.g., minimum_nights and minimum_minimum_nights. The latter presumably refers to the fact that min/max night stays can vary over the year. The default (i.e., most frequently applied) min/max night stay values will be used instead.

The ‘Host Since’ is a DateTime column and will be converted into a measure of the number of days that a host has been on the platform, measured from the date that the data was scraped (30 April 2021). The ‘Host Since’ column is dropped here.

In ‘First Review’ and ‘Last Review’, About a quarter of listings have not had a review written for them. This is too large a proportion of the dataset to drop, and dropping the columns would lose a lot of useful information — reviews are very important in people’s decisions to book, and therefore price.

This is also too large a proportion of the dataset to simply replace with median/mean values, as this would skew the distribution substantially. Also, the missing values here are not really missing values, like the fact that they are NaNs is meaningful — it tells us that these are new or previously unbooked listings that have not had reviews yet. In order to make the resulting model workable to predict prices for any Airbnb listing, including brand new listings, is actually beneficial to keep them in. Therefore, these will be kept as an ‘unknown’ category, and the feature will have to be treated as categorical (and therefore one-hot encoded) rather than numerical.

Data Transformation

In Property type, some cleaning of property types is required as there are many categories with only a few listings. The categories ‘apartment’, ‘house’ and ‘other’ will be used, as most properties can be classified as either apartments or houses.

In Amenities, some amenities are more important than others (e.g., a balcony is more likely to increase price than a fax machine), and some are likely to be uncommon (e.g., ‘Electric profiling bed’). Based on previous experience working in the Airbnb property management industry, and research into which amenities are considered by guests to be more important, a selection of the more important amenities will be extracted. For example, if it turns out that almost all properties have/do not have a particular amenity, that feature will not be very useful in helping explain differences in prices.

For Host_response_rate, with about a third of values being null. This will also be kept as its own category, after grouping other values into meaningful groups (i.e., transforming this into a categorical feature, rather than a numerical one). Because about 70% of hosts respond 100% of the time, this will be kept as its own category, and other values will be grouped into bins.

Encoding and Transformation

One-Hot encoding is used on all the categorical variables so that they can be used for modeling later.

As numerical differences have scale differences applying standardization is necessary for model performance. We chose standardization over normalization because standardization is not affected by outliers.

Models and Performance Evaluation

Train Test Split

We used the Hold out method for train test split with 70% as training data and 30% as testing data.

KNN

With Hyper-Parameter Tuning, parameters like Neighbors, Metric, and Weights were tuned. We varied Neighbors in range of 1 to 21, metric between Euclidean and Manhattan, and weight between uniform and distance.

Fig 7: KNN — Correlation line between Predicted and Actual Values

With Hyperparameter tuned KNN model, the best performing model was with neighbors as 10, distance as Manhattan, and weight as distance. The RMSE for the best performing model is 112.625 and MAE is 52.70.

Decision Tree

In a Decision tree, parameters like max_depth and criterion were tuned to find the best performing model. The max_depth ranging from 1 to 31 and the criterion for the quality of split was varied among mse, friedman_mse, and mae.

Fig 8: Decision Tree — Best Performing Model

Fig 9: Decision Tree — Purity Method Plot

Fig 10: Decision Tree — Correlation Plot

With Hyperparameter tuned Decision Tree, the best performing model was with depth as 5, purity method as MAE having RMSE 109.158

From Fig 9, we can see that the purity method MAE has the least RMSE value as compared to MSE and Friedman_ MSE.

Fig 12 shows a correlation line plot between the Actual values and the predicted values.

Random Forest

Fig 11: Random Forest — Purity Method Plot

For Random Forest, we tuned parameters like max_depth, criterion, and n_estimators. The parameter max_depth was tuned in the range of 1 to 31, criterion for data split quality mse and mae.

Fig 12: Random Forest — Correlation Plot

The best model was with depth as 25 and purity method as MAE having RMSE as 98.5266

XGB

For XGB, we tuned parameters like learning rate, max depth, min child weight, subsample, colsample by tree, nestimators, and objective.

Fig 13: XGB — MAE and RMSE

XGB is the best performing model with RMSE as 96.2021

Results

The most important features for the price of Airbnb accommodation in Boston are whether the Entire Apt/Home is available or not, number of bathrooms, number of people it can accommodate with feature weight as 0.20,0.09,0.08, and 0.03 respectively.

Using the XGB model, we are able to predict prices with RMSE of 96.

Few other important features are parking space availability, number of nights the Airbnb is available and host response for predicting the price.

Looking at the results of the different questions that we wanted to ask, we found that the In Boston mentioning features or amenities like Availability of entire apartment, Number of bathrooms, Number of apartments can accommodate, and Exercise Facilities would certainly influence the price for the property. The parking space, ratings, and amenities would certainly have an effect as well. So, Using the model and rules of association for features would make a listing stand out from the crowd and would help hosts as well as Airbnb to predict the best price range for any property in Boston.

GitHub

GitHub - chaurevishwajit/AirBnb-Price-Prediction

Since its inception, Airbnb has been a critical market driver in addressing a low-cost lodging problem by investigating…

github.com

References

https://towardsdatascience.com/predicting-airbnb-prices-with-machine-learning-and-location-data-5c1e033d0a5a#fb28

Data cleaning in Python: examples from cleaning Airbnb data

How to deal with messy location and text data in Python

towardsdatascience.com

https://www.kaggle.com/hrbzkm9898/seattle-airbnb-data-preprocessing