How does location affect the price of Airbnb in Prague?

Published in

Analytics Vidhya

5 min readMay 4, 2020

The city of Prague went from obscurity around the turn of the century into being the 5th most visited European city. It is not surprising that the Airbnb market in Prague is booming, hosts are competing for every booking and tourists are looking for the hidden gems amongst listings.

In this analysis I was the most interested in how do different features of an Airbnb listing, especially the location, affect the price.

What data is available to us?

Thanks to the Inside Airbnb project we are able to use data about listings freshly scraped straight from Airbnb. This includes information about:

property — number of bedrooms, bathrooms, amenities, property type, text description, etc.
hosts — number of other listings, superhost status, hosting experience
property location — neighbourhood info, latitude and longitude
availability
summary of reviews

The only usable geographical feature that could have an effect on price in our data is the name of the neigbourhood. Since we have the latitude and longitude values for each listing, we could determine how far is it from the city centre. A naive approach would be to just compute the distance “as the crow flies”. But we could do better…

Assuming that time to city centre by public transport is the most desirable geographical quality of a listing we can enhance our data set with a “transit time to city centre” feature by leveraging the power of Google via their Google Maps Route API. The Old Town Square was chosen as our centre location and all times were obtained for a weekday at 9:00 AM.

Data visualisations

Thanks to Airbnb’s geospatial data we are able to visualise the distribution of listings in Prague.

The colour saturation of a borough represents the share of listings within the Prague market

Kernel density estimation plot for all the listings

We can see that the central district (Prague 1) has by far the biggest share among the listings, almost 30%. The surrounding boroughs are still reasonably popular, especially districts Prague 2 and 3. On the other hand, the outskirts are still very underrepresented with rare congregations of listings.

Visualising public transit times

Each dot is an individual listing and the hue represents the time to city centre

From above plot we can see that the times don’t increase linearly with more distance from the centre but they follow line patterns originating in the city centre. These in fact coincide with the Prague metro system.

The colour saturation of a borough represents the mean public transit time to the city centre for listings within them

This plot gives us an idea about which boroughs might be deceiving us with their location. For example Dolní Chabry and Ďáblice are geographically fairly close to the city centre but they are lacking connection to the Prague transit system. The opposite could be said about Satalice and Běchovice neighbourhoods which are connected directly to the city centre by a regional train service.

Influence of location on listing’s price

The approach taken in this analysis is to train a machine learning model predicting price per day on available features of a listing and then examine their influence on the model’s decisions.

Feature engineering included removing irrelevant and highly correlated variables from the data set, one-hot encoding categorical variables and imputing missing values (detailed information in associated Github repo)

The machine learning technique chosen was regression using the popular XGBoost algorithm. Trained model includes information about feature importance based on average weight, informational gain or coverage of its decision trees, but these metrics are sometimes misleading, especially if we didn’t sufficiently eliminate correlated features.

Explanations using Shapley values

For interpreting the influence of our features we are going to turn to a concept from economic game theory. By predicting using various subsets of our feature space and seeing how it affects our predicted price, we compute average contributions of each feature to the output - Shapley values. Find more information about this technique and the used library in the SHAP repository.

SHAP summary plot for our predictive model, the output is the impact on price in Czech crowns (CZK).

We can see that the most important feature for our model’s decisioning was how many guests does the property accommodates. The second most important predictor is the travel time to the centre, just after the number of allowed guests. As we expected, it has the inverse relationship with the price: the less time it takes to travel from the property to the city centre, the higher the price. It also has much higher influence than other seemingly important factors, e.g. the review scores. There is an interesting phenomenon where the number of reviews has a negative influence on the price which seems counter-intuitive but could be simply because the cheaper properties accommodate less people and are rented much more often — therefore a higher review count.

SHAP partial dependency plot for “time to city centre”. We can see clear decrease in price as the travel time increases. The hue represents the number of people a given property can accommodate

Conclusion

We looked at various ways we can visualise geospatial data and concluded that location of properties has, not surprisingly, a high impact on the price of the property. I also wanted to show the value of constructing new features from external sources, in our case the Google Maps API. With this analysis we are only scraping the surface and there is massive potential for providing insight into the Airbnb market. Feel free to download up-to-date data from Inside Airbnb and play around with it.

All code associated with this post is in my Github repository: https://github.com/jakubkocvara/prague-airbnb.