This article aims to develop a foundation to perform an analysis of the data presented by Airbnb. It demonstrates how to formulate scrapped data into features that will assist the model to predict the listing’s price.
Airbnb, Inc. is an American vacation rental online marketplace company. Airbnb maintains and hosts a marketplace, accessible to consumers on its website or via an app.
Here is the reference to the GitHub repository that provides access to the Python notebook and step by step documentation to set up the project:
1. Data acquisition
The platform hosts updated data of the Airbnb listings. Data from any desired location can be used for this project, however, this analysis has been performed on Boston’s data. Data from multiple locations can be used as well and can be merged using pandas.
Please follow the step by step guide available on GitHub repository to get detailed information on how to download the data for this project.
2. Data preprocessing
To utilize the data effectively, relevant data preprocessing techniques must be applied.
As a part of this preprocessing following actions were performed:
- Converting date to pandas DateTime format
- Removing currency symbols from price and converting it to a continuous data type Float, will later assist the model to predict continuous values
- Removing the per cent symbols for some features like acceptance rate to convert them to integers
- Removing outliers, this step is essentially performed to make sure that abnormalities present in data may not reciprocate in our statistics and modelling:
3. Features engineering
To further enhance the feature set, some of the columns needs to be parsed, for example, the ‘host_verifications’ and ‘amenities’ column can be further processed and parsed to be used as an effective source of information. As a sample, here is the column of amenities and host verifications containg default scrapped values:
After the relevant preprocessing performed with (get_unique_features and get_list_as_features) functions, we retrieve the following results:
Additionally, like host verifications feature, the data contains some other features describing the host further. One such field is ‘host_since’, this is the date when the ad poster joined the platform. By calculating the number of days host has been on the platform, we can enhace the features. Lets see going further of that affects the model at all when we perform evaluation.
4. Statistical analysis
Following is the distribution of frequency of listings throughout the year of 2020:
Figure shows the average price of listings throughout the year. It has a calculated weekly average of the number of ads posted on the platform.
Now let’s see if there exists a correlation between the number of listings published in a time frame and their average price:
Figure shows the correlation between the number of listings and average price. As the number of listings rise, the average price of the listings rises proportionally.
Moving on, Boston, being a big city, let us visualise the impact of different neighbourhoods present in the city on the listing price:
5. Model training
After the preprocessing, feature engineering and encoding the data into respective formats, we split the data in train and validation set like any other regression problem.
For the purpose of this project, we use XGBoost as our model. Please note that the purpose of this experiment is not to achieve the highes accuracy, but to build a pipeline and critical thinking for the problem. The models can be replaced with different parameters or model:
6. Model evaluation
The features like facebook, jumio, government_id etc. we derived earlier can be clearly seen contributing towards the model development.
Note: The featureset ‘amenities’ that we parser earlier was not used in this model, however you can try adding that too.
7. Future model improvements
The Airbnb data also comprises of the reviews of listing present on the platform, with this project I have included the methods to perform the sentiment analysis. This further can be added to improve model performance. I use Spacy’s text blob for the purpose which can be replaced easily.
Thank you for reading the article, feel free to use the repo provided for your experiments.