Airbnb data exploration, analysis and feature engineering

Rafay Ullah Choudhary
Feb 26 · 5 min read

This article aims to develop a foundation to perform an analysis of the data presented by Airbnb. It demonstrates how to formulate scrapped data into features that will assist the model to predict the listing’s price.
Airbnb, Inc. is an American vacation rental online marketplace company. Airbnb maintains and hosts a marketplace, accessible to consumers on its website or via an app.

Here is the reference to the GitHub repository that provides access to the Python notebook and step by step documentation to set up the project:
https://github.com/rafayullah/Airbnb

1. Data acquisition

The platform hosts updated data of the Airbnb listings. Data from any desired location can be used for this project, however, this analysis has been performed on Boston’s data. Data from multiple locations can be used as well and can be merged using pandas.

Please follow the step by step guide available on GitHub repository to get detailed information on how to download the data for this project.

2. Data preprocessing

To utilize the data effectively, relevant data preprocessing techniques must be applied.
As a part of this preprocessing following actions were performed:

  • Converting date to pandas DateTime format
The function removes the quantiles present above 0.999 and below 0.001, this ensures that there are no boundary cases present like false ads having 0$ as listing price or ads having abnormally high price values. This step is performed for every room type.

3. Features engineering

To further enhance the feature set, some of the columns needs to be parsed, for example, the ‘host_verifications’ and ‘amenities’ column can be further processed and parsed to be used as an effective source of information. As a sample, here is the column of amenities and host verifications containg default scrapped values:

After the relevant preprocessing performed with (get_unique_features and get_list_as_features) functions, we retrieve the following results:

Amenities parsed as features
Host verifications parsed as features

Additionally, like host verifications feature, the data contains some other features describing the host further. One such field is ‘host_since’, this is the date when the ad poster joined the platform. By calculating the number of days host has been on the platform, we can enhace the features. Lets see going further of that affects the model at all when we perform evaluation.

Host presence in days on the platform

4. Statistical analysis

Following is the distribution of frequency of listings throughout the year of 2020:

Figure shows the average price of listings throughout the year. It has a calculated weekly average of the number of ads posted on the platform.

Now let’s see if there exists a correlation between the number of listings published in a time frame and their average price:

Calculating the average price and the total number of listings per week in the dataset

Figure shows the correlation between the number of listings and average price. As the number of listings rise, the average price of the listings rises proportionally.

Moving on, Boston, being a big city, let us visualise the impact of different neighbourhoods present in the city on the listing price:

The figure represents the average listing price of Entire home/apartments, private rooms, shared rooms and hotel rooms across the neighbourhoods. On average, the highest prices asked are for the entire home/apartments and many of the neighbourhoods have no listings for shared and hotel rooms.
Just like the average price of different room types, the chart above explains the average deviation of prices of listings present across neighbourhoods. This means that near West Roxbury, the listing price fluctuates the most. Likewise, Hyde Park has the lowest average difference among the listings present.

5. Model training

After the preprocessing, feature engineering and encoding the data into respective formats, we split the data in train and validation set like any other regression problem.

For the purpose of this project, we use XGBoost as our model. Please note that the purpose of this experiment is not to achieve the highes accuracy, but to build a pipeline and critical thinking for the problem. The models can be replaced with different parameters or model:

6. Model evaluation

Model performance on training set (Actual=Blue, Predicted=Orange)
Model performance on validation set (Actual=Blue, Predicted=Orange)
Feature importance (see the full picture at GitHub notebook)

The features like facebook, jumio, government_id etc. we derived earlier can be clearly seen contributing towards the model development.

Note: The featureset ‘amenities’ that we parser earlier was not used in this model, however you can try adding that too.

7. Future model improvements

Sentiment analysis

The Airbnb data also comprises of the reviews of listing present on the platform, with this project I have included the methods to perform the sentiment analysis. This further can be added to improve model performance. I use Spacy’s text blob for the purpose which can be replaced easily.

Spacy also makes mistakes, like in the second record, it puts a negative polarity due to the word ‘base’ present. But overall the performance is acceptable

Thank you for reading the article, feel free to use the repo provided for your experiments.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Rafay Ullah Choudhary

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store