Welcome to my first post on data science!
In this article, I’m going to analyse the Boston Airbnb listings, available in Kaggle on the following link.
To guide this analysis in a concrete way, the following three questions will be answered:
- Which are the most expensive neighbourhoods in Boston?
- Is it possible to create clusters of the Boston Airbnb’s?
- What are the factors that influence the price of Boston Airbnb?
The analysis will begin with a data understanding section, then the data preparation section, to finally go on to answer each of the questions.
The approach of this article is to share the questions and insights about data with a non-technical approach. For more technical details, you can consult the GitHub repository here.
The analysis is based on the listings database, which contains 95 columns and 3585 rows, which correspond to accommodations published on Airbnb in the city of Boston.
There are several not relevant columns to answer the questions, therefore, as a first step, I selected only the relevant information and performed a basic data cleaning, like removing unwanted signs ($, %), in order to have more readable information.
Now, we can continue with the data exploration process.
The variable of interest: price per accommodate
A variable that is important to analyze is the price or versions of it. Price by itself is a variable that has a lot of variance and outliers, as we can see on the left side of figure 1. Since there is enough information, I am going to take a look at the price per accommodate variable, whose histogram is on the right side of figure 1.
I do this exercise because it seems more intuitive to use a variable such as a price per accommodate or price per square meter, than the price alone, since the latter can represent a very large variety of properties, which can be narrowed down in a simple way.
The histograms in figure one, are both from the variable price per accommodate. On the left, we have the whole range, and on the right, is the same data, with an upper limit of 300. For this first figure, it is possible to conclude that the price has a right-skewed distribution and that we are in presence of outliers.
Besides the price per accommodation, there are 15 other numerical features, whose histograms can be seen in figure 2.
Another important point that we must pay attention to is the number of null values in each of the numerical features. From figure 3, it is possible to see that in most cases, the percentage of nulls in each column is low, except in square_feet, where more than 98% of the data points are null.
The categorical features present in this dataset are the following 8:
- neighbourhood — 25 categories
- property_type — 13 categories
- room_type — 3 categories
- bed_type — 5 categories
- cancellation_policy — 4 categories
- host_response_type — 4 categories
- host_is_superhost — boolean
- host_identity_verified — boolean
The null values in the categorical variables are close to zero in almost every variable, except for host_response_type, where they represent less than 15% as can be seen in figure 4.
Another important point to consider in this exploratory analysis is the number of data points in each category. In figure 5 it is possible to appreciate the number of observations in each category of the neighbourhood feature.
This is important because the robustness of the results will depend on the amount and variety of the data available.
Similar charts for the other 7 features can be found in the GitHub repository mentioned above.
Finally, there are 43 amenities, that correspond to different features of Airbnb, in figure 6 it is possible to appreciate the proportions of properties that have each amenity.
After data exploration, it is easy to understand the challenges we must address in order to prepare the data for analysis. The three main problems that we are going to tackle are null values, handling of categorical variables and outliers.
To handle null values I applied the following measures:
- Remove square_feet, for having more than 95% null values.
- Fill in the null values with the median in the numerical characteristics
- Fill in the null values with the mode in categorical characteristics.
When dealing with null values, the most traditional approaches are filling the missing values with either the mean, the median or the mode. For the numerical variables, I chose the median because the variables, in this case, are not close to a bell-shaped distribution, and the median is a better measurement for a centre than the mean in these cases. For the categorical variables, the mode seems a more representative value for this dataset.
For the handling of the categorical variables, I used one-hot coding, since to answer questions 2 and 3 it will be necessary to use a machine learning algorithm, which does not recognize the raw category variables.
The other available option to encode categorical variables is integer encoding, but this option adds a distortion as it gives more weight to one category over the other, which in this case is not true. This is why I decided to use dummy variables for all the categorical characteristics mentioned above.
Finally, I need to deal with outliers. Remove them is a difficult task, we don’t want anomalies to distort the analysis, but we don’t want to clean the dataset so much that we ended with an overfitted model either.
For now, I will use Isolation Forest to detect and remove outliers. In a broad way, this algorithm is an ensemble method, similar to a random forest, that explicitly identifies anomalies, instead of profiling normal behaviour, that is the most traditional approach.
In practice, I applied the sklearn implementation, detecting and removing 180 outliers, which represents less than 5% of the data.
Which are the most expensive neighbourhoods in Boston?
For this first question, you don’t need a complicated analysis to get the first insights. The answer to this question can be answered with figure 7. Here we have the price per neighbourhood, ordered by the median.
It is possible to see that there are outliers in all neighbourhoods, but also a clear trend, with most of the data points concentrated near the medians and a ranking of the different neighbourhoods, from the cheapest, West Roxbury, to the most expensive Leather District.
Many times it is true that an image is worth more than a thousand words and for a first approach, figure 7 is a good, quick and simple answer to this question. To go deeper, one option would be to do an ANOVA test, to have a more concrete quantitative result, but I’m going to leave the analysis until here, for this post.
Is it possible to create clusters of the Boston Airbnb’s?
The spirit of this question is to find out whether it is possible to separate the properties listed on Airbnb into groups, or clusters. It may be the case that there are no significant differences between the properties, but if there are, it could be a useful tool to recommend actions to the owners to improve their ads or see if it is possible to match a cluster of properties with a cluster of end customers, among other things.
To answer this question I decided to use k-means, an unsupervised algorithm that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean
The first step was to normalize the numerical variables, to avoid distortions and facilitate the learning process.
The second step is to find the optimal value of k, which is a hyperparameter of the model. For finding k I used the elbow method shown in figure 8.
The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use, in this case, I used k = 3.
After running k-means, as a result, we have 3 clusters: c0 with 1297 data points, c1 with 1163, and c2 with 945. So far, so good, but the interpretation of the different clusters are not straightforward. We have 116 different variables, and in a great number of them, the three clusters do not really behave as different as I would have expected.
In an effort to interpret the results I created boxplots for the numerical variables, divided by cluster, like the one in figure 9.
As we can see in figure 9, the prices per accommodate do not change much between one cluster and another. Something similar happens with the other numeric variables, whose boxplots can be found in the GitHub repository.
To compare the categorical variables, I created a score. This score corresponds to the number of data points that belong to a certain category and cluster, divided by the total number of observations in the data set.
In figure 10, it is possible to see some differences between groups. Cluster 1, has more houses, compared to clusters 1 and 2, the same phenomenon occurs with the private room, but in general, there are no major differences between clusters.
Now, let’s take a look at the amenities. I used the same score as in the categorical variables. In figure 11 are plotted 15 of the total number of amenities. Here it is possible to see some differences. For example, cluster 0, offers 24-hour check-in in more proportion than the other two clusters, but in general, the tendency is that all three clusters offer the same amenities, not showing clear differences between clusters.
For the purposes of this post, we will leave the analysis until here, but my conclusion is that this question has not been answered and that there is much more to explore.
To say that it is not possible to separate the Airbnb listings into clusters is a hasty conclusion. As possible next steps, I would explore different clustering techniques and perform feature selection in order to have more interpretable results.
What are the factors that influence the price of Boston Airbnb?
To answer this question, there is an infinity of approaches that can be used. I chose to use Random Forest and the interpretation of feature importance, as it is the technique with which I am most familiar.
This approach can be divided into two main parts. First, we need to create a predictive model for the price per accommodate utilizing a random forest algorithm, and then interpret the feature importance of that model.
It is important to bear in mind that if the predictive model is weak, so will the interpretations of the feature importance, that is why as in any machine learning exercise, we must measure the predictive capability of the model, as a signal that we are capturing the real distribution behind this data.
In the data preparation section, we already did the necessary data pre-processing to be able to create the predictive model. Following, we need to divide the data into the response vector, which in this case is the price per accommodate, and the explanatory matrix, which in this case are all the remaining 115 features. Then, we divide into train and test sets and train the algorithm with the train set and finally, we predict the response of the test set and evaluate.
As an evaluation metric, use RMSE, a traditional metric in regression tasks like this. The RMSE represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences. As a general rule, the smaller the value the better, considering that in the scenario of RMSE = 0, we are probably very overfitted.
In the first model performed, I obtained RMSE results of 62.53 in the test set and 23.03 in the train set, which seemed very large to me. To take a closer look, I made the graph in figure 12, where the real response is in blue and the predictions of the model in orange.
Here we can clearly see that the model is not good at predicting high prices, but we can also conclude that the cleaning of outliers done previously seems to have not been enough.
For the second iteration of the model, I used an extremely simple approach, I applied an upper bound for the price per person at 200. The results of this model are an RMSE of 24.46 on the test set and 16.62 on the train set.
In figure 13 we can analyze the results, clearly seeing that the model does not predict well either low or high prices.
To improve this model there are thousands of things we can do, some of them are, hyperparameter tuning, improving data-preprocessing by being more careful with the outliers and making feature selection, utilizing other algorithms that can be a better fit for this dataset, etc.
For now, we are going to think that for the purposes of answering this question this second model is good enough and we are going to move on to the interpretation of the feature importances.
In the implementation of random forest reviewed here is the sklearn implementation, where the attribute feature_importances_ corresponds to impurity-based feature importance, for now, is only needed to know that the higher, the more important the feature. The importance of a feature is computed as the normalized total reduction of the criterion brought by that feature and it is also known as the Gini importance.
In figure 14, there are the 20 features with the highest feature importance and therefore we can conclude that these 20 features are the most influential variables to the price.
Note that several neighbourhoods appear among these most influential variables, which can be an indicator that the neighbourhood is important, remembering that since we used dummy variables, each neighbourhood appears as a different feature in the dataset, but in reality, it means that the feature neighbourhood is important to determine the price.
As a first approach to a data set, this exercise was very interesting. As we saw in each of the questions, much more can be done and in my short experience in data science, I see that it is always possible to improve the analysis and go further.
There is a natural trade-off between time invested and quality of results and it must always be borne in mind that sometimes it is better done than perfect. Many times a simple plot can be enough to reach the objective and answer a question. Simple is better, but always remember to be transparent and highlight all issues that you can detect and be honest in the real significance in the interpretation of the results.
Something that is always important is practice, that is why I encourage you to redo this exercise, asking questions of your interest. It’s a fun way to structure work and advance on the path of becoming a data scientist. Good luck!