How to Choose Accommodation on your next Airbnb Getaway

An Analysis of Istanbul using Airbnb Data

Daniel Grewal
Analytics Vidhya
15 min readJun 18, 2020

--

Hanging umbrellas on the streets of Kadikoy
Kadikoy, Istanbul

Introduction

For thousands of years, Istanbul has been considered the city at the centre of the world. Here, East meets West. Greek, Byzantine, Roman and Ottoman influences have blended to create one of the most culturally diverse populations that exists anywhere on the planet. Therefore it is not hard to see why this city is one of the most visited places on earth. With that, this post reveals some insights into helping you choose your ideal accommodation for staying in this diverse and timeless city. In this post, we will be looking at the following:

  • Which side of the city has cheaper accommodation?
  • Which month of the year has the best deals on accommodation?
  • What are customers saying in accommodation reviews?

To help answer these questions, the data was filtered on the following conditions:

  • Only listings where the minimum night stay was less than five days were considered. This was so that more focus could be on tourists as opposed to long term let occupants.

Only the following property types were considered:

  • Apartments (including serviced apartments)
  • Houses
  • Hotels (including Aparthotels and Boutique hotels)
  • Townhouses
  • Bed and Breakfasts

These property types made up over 92% of all listings with apartments making up over 70%.

The airbnb data that was used for the analysis, split Istanbul into 39 neighbourhoods. The data was sourced from http://insideairbnb.com/ which is a well-maintained website of airbnb data for cities across the world. It should be noted that this website is not affiliated with airbnb in any way and instead is maintained by an open-source community.

Before stepping into the analysis, it’s worth pointing out that all of the prices that are displayed throughout the post are in Turkish Lira, which as of 13th June 2020, 1TL = 0.12GBP = 0.15USD. I have also provided links at the bottom of this article to resources that I found useful when completing the analysis, as well as a link to my GitHub repository where all the code related to the analysis is stored.

Which Side of the City has Cheaper Accommodation?

The listings in the dataset were spread across 39 different neighbourhoods, with the majority of those neighbourhoods being located on the European side of the Bosphorus. The opposite side of the Bosphorus, to the East, is known more commonly as the Asian side of the city. Some neighbourhoods are also located further north along the Bosphorus and out towards the Black Sea.

  • There were nearly four times as many listings on the European side as there were on the Asian side.
  • The most popular area of the city for airbnb listings was the Beyoglu neighbourhood. This neighbourhood is located on the European side, across the Golden Horn from the main centre of Istanbul and it is where famous landmarks such as Taksim Square and the Galata Tower can be found.
  • The visualisation on the left-hand side below, shows the count of the different type of airbnb listings, colour-coded by what side of the city they fell on.
  • The visualisation on the right-hand side was used to identify whether or not there were any listings that could be considered as outliers based on their average price per night.
  • Apartments were the most commonly listed listings on both sides of the city.
  • The diamond shaped points were considered as outliers as the average price per night associated with these listings was relatively much higher than other similar property types, depending on which side of the city they were on.
  • There were a number of apartments that were considered as outliers whose average price per night was well into the tens of thousands of lira.

When investigating the “outlier” properties identified during the visual analysis shown above, it was found that many of these listings’ price per night (ppn) varied by large amounts across the twelve months of data.

  • Standard deviation was used to measure the average amount, that a listings’ ppn, varied from the average (mean) ppn of that listing.
  • Across all property listings, the average standard deviation of ppn throughout the twelve months of data was 43TL (about £5.16 at the time of this analysis). This meant that on average, the ppn of a property would be within 43TL of the 12 month average ppn of that property.
  • Some listings’ prices per night, deviated on average over 1000TL (£120) throughout the twelve months of data.
  • There were two listings whose standard deviation of ppn was well over 10,000TL. Further inspection suspected that inaccurate prices had been listed. This could have been a data entry error by the host of the property or it could have been an error in the collection of the data.

The visualisation above shows the average ppn for the different property types. With the exception of houses, property types on average, were found to be at least 100TL more expensive on the European side of the city when compared with those on the Asian side of the city. The average ppn of Boutique Hotels on the Asian side of the city were found to be nearly half that than those on the European side.

The visualisation below shows the average ppn for properties by their type, region and the room type that was being advertised. The most obvious detail that sticks out here is that the average ppn for Bed and breakfast (with Entire room/apt as the room type) listings on the European side of the city far exceeds other comparable prices in the plot. This could be explained as there were only five listings that matched the listing description and the ppn of one of these listings was relatively high.

Visually, it appeared that property listings on the Asian side of the city were on average, cheaper than those on the European side. A statistical analysis of the correlation between average ppn and region found that there was little to no correlation between these variables in the data.

  • The point biserial statistic was used as the correlation measure. This was an appropriate statistic to measure the correlation between a continuous variable and a discrete dichotomous variable. This statistic would return two values; one being an r coefficient that measures how much variance could be explained in one variable, by the other; the second being a p-value that reports the significance of the test statistic (in this case, the r coefficient). If the p-value is less than 0.05, this indicates that there is a less than 5% chance that the test statistic results were random meaning that we can have a high degree of certainty in the results being reported. Larger p-values mean that we can be less confident that the test statistic is not random and therefore not a statistically significant result.
  • The correlation between average ppn and the side of the city that listings fell on was 0.06 (with a p-value less than 0.05).
  • This meant that only 6% of the variability in average ppn could be explained by whether or not a listing was on the European side or the Asian side of the city.

However when comparing the relationship between the average ppn in the neighbourhoods of Beyoglu (European) and Kadikoy (Asian), the correlation was -0.15 (with a p-value less than 0.05). Beyoglu and Kadikoy were chosen for comparison as they contained the most property listings for the side of the city that they fell on.

  • This meant that 15% of the variation in the average ppn could be explained by whether or not a listing was located in Beyoglu or Kadikoy.
  • The negative value indicated that listings with a high average ppn were associated more with the Beyoglu neighbourhood. For the correlation analysis, Beyoglu was coded as 0 and Kadikoy as 1.
  • This meant that 85% of variation in the average price per night could be explained by other variables in the data such as the number of rooms/room type/review scores etc.
  • Compare this to 6% of the variation in average ppn that could be explained by whether or not a listing was on the Asian or the European side of the city.
  • The latter point could be explained by the fact that there were many more neighbourhoods on the European side of the city than on the Asian side. Many of the European neighbourhoods were located further away from the centre of the city where average ppn for listings were more comparable with those on the Asian side.

When taking into account the different property types and room types, accommodation on the Asian side of the city has comparably better prices than those on the European side. Of course many of the tourist attractions are located on the European side of the city and this means that neighbourhoods within a relatively close proximity to these, contain listings with higher ppn than what could be found in almost any other neighbourhood of Istanbul. Therefore many tourists may prefer to pay higher prices to be located closer to these attractions.

However with public transport in Istanbul being so cheap and with the ferry crossings on the Bosphorus carrying passengers frequently between the two sides, it could be argued that the Asian side of the city is where tourists will find better deals on accommodation. Here they will also likely experience more of the culture that Istanbul has to offer, not to mention the chance to experience trips across the Bosphorus that offer some of the best views to be had of the ancient city that rests on its seven hills.

The visualisation below shows the average ppn across two popular neighbourhoods from both sides of the Bosphorus.

  • The biggest comparisons to be made here were the difference in prices of Boutique hotels. There were many more of these property types listed in Beyoglu hence the larger spread of average ppn.
  • The average ppn of many of the property and room types were comparable between the two neighbourhoods.
  • However Beyoglu had property and room types that were much more expensive than those found in Kadikoy but a number of these were considered as outliers and so were relatively few in number.

Which Month of the Year has the Best Deals on Accommodation?

So, we have found that the Asian side of Istanbul has the better-priced accommodation (when discounting for European neighbourhoods located far out from the city centre), but when are the best deals to be had across the city?

The heatmap below shows the average listing ppn across the year for each neighbourhood. Lighter colours indicate higher average prices. At first glance, it seemed that listings in Fatih and Kartal were on average, more expensive than other neighbourhoods. This was not surprising given that Fatih is where many of the most famous landmarks in Istanbul can be found and Kartal is a coastal (Asia side) neighbourhood that has been heavily re-developed over recent years with excellent transport into the city centre.

The heatmap above has been broken down into two separate visualisations, one for European neighbourhoods and one for Asian neighbourhoods.

The heatmap below represents average ppn across neighbourhoods from the Asian region.

  • Kartal seemed to have the highest average ppn by some margin.
  • Sile is a coastal neighbourhood that can be found on the Black Sea, north of Istanbul. This neighbourhood is popular with Istanbul residents for summer holidays. It was surprising to see that this neighbourhood had lower ppn during the summer months when it would be at the height of the tourist season.
  • Overall, it seemed that many neighbourhoods had listings with lower ppn during the summer months.

The heatmap below shows the neighbourhoods for the European region.

  • There are many more neighbourhoods on this side of the Bosphorus that had higher average ppn than on the Asian side.
  • Similar to the Asian neighbourhoods, prices seemed to decrease during the summer months.

These visualisations could indicate that if looking for the best deal on accommodation, the summer months may be the best choice. However it should also be noted that temperatures in Istanbul during these months can make exploring the city, extremely uncomfortable. Could this be the reason why prices tend to decrease during these months? Further analysis could be done by analysing the number of bookings at different times of the year to see whether or not tourist numbers drop during the summer season.

Analysis was done to derive insight into whether or not seasonal trends could be observed by property and room type over the twelve months in the data.

  • The visualisation below on the right-hand side shows that hotel room type prices increased month on month between January and July before decreasing into the autumn.
  • However according to the visualisation on the left, hotel property type prices decreased from May onwards.
  • This was because, as we have seen in the visualisations from the previous section, some hosts listed rooms at hotels as private rooms or shared rooms or even entire homes. These other type of rooms were also used in listings for other different property types. This would have affected the results of the plots below.

There was no definitive answer as to when is the best time to visit Istanbul. It seemed to depend on what type of accommodation would be preferred by the guest. There seemed to be overall price decreases across the summer months within a number of neighbourhoods. But analysis of prices based on property and room types seemed to offer conflicting insights. Further analysis to look at the number of bookings across the year may have offered more insight into answering the question of when is the best time to visit Istanbul.

What are Customers saying in Accommodation Reviews?

When determining where you stay when on holiday, looking at what other people have reviewed about accommodation on your shortlist can make or break whether or not you click that reserve button. Sometimes you may not have the time to sit and read review after review. Instead, you may just want a summarised version of a whole bunch of reviews for a particular accommodation.

Several methods were considered and tested when looking at how best to identify the main themes in the review comments.

  • Topic modelling techniques such as Non-negative Matrix Factorization and Latent Discriminant Analysis were tested but these were not able to identify distinct themes in the reviews.
  • There were also a number of different languages that the reviews were written in, and so this had to be taken into account when thinking about how to model the data.

In the end, with a little help and inspiration from other great Medium posts (see links below), a process involving multiple stages of K-means clustering and Word2Vec modelling was chosen. The approach to this, can be summarised in the following steps:

  1. Clean review comments using the standard approach (convert to lowercase, remove stopwords and punctuation etc.)
  2. Identify frequently occurring bigrams.
  3. Develop and apply a Word2Vec model to the cleaned review comments. We can call this the multi-language Word2Vec model.
  4. Apply K-means clustering to the resultant word vectors to distinguish the English comments from the French, German and Turkish comments.
  5. Apply a “restricted” version of the earlier Word2Vec model so that it only includes words from the English vocabulary. Let’s call this the English Word2Vec model.
  6. Apply a new K-means clustering model to the resultant word vectors to identify the different themes that were observed in the corpus of review comments.
  7. Apply TF-IDF (Term Frequency-Inverse Document Frequency) to the English comments that were identified using the multi-language Word2Vec model, and extract the key terms within each of the reviews.
  8. Finally, map the key terms that were extracted from the reviews to the cluster (topic) that they were classified into from step 6 and voila! You have a theme view of each review.

The result of the steps above, was that twelve themes were identified throughout the English language review comments. This part of the analysis was subjective and so if somebody else was to perform this, they are likely to come up with a different number of topics based on the clustering parameters. However, from the twelve clusters that were applied onto the data, I identified topics that were centred on charming neighbourhoods, helpful hosts, food and culture, transport, essentials that were included in the stay, the negatives and more.

Examples of comments and the associated themes can be found below:

As you can see, the first comment is completely negative hence why it was the only theme that was identified. The second example shows that several themes were found in the associated comment. I’m not sure if the comment contained anything that could be identified as an “authentic, charming, beautiful neighbourhood”, especially with the loud music being reported. However I did see themes of good location, helpful host, accommodation essentials as well as things to be aware of in the review commentary. I’ll leave you to be the judge of whether or not the model did an overall good job in identifying the key themes from that comment.

We can also get a sense of how well the Word2Vec models performed on the review commentary. When looking for the most similar terms to ‘disgusting’, it came back with (in descending order of similarity):

  • dirty
  • dust
  • filthy
  • hairs
  • mould
  • stained
  • dirt
  • stains
  • toilet_seat
  • dusty

And for the most similar terms to ‘culture’, we get:

  • history
  • turkish_culture
  • learn
  • discussions
  • local_culture
  • dubai
  • knowledge
  • insight
  • tulum
  • knows

Notice that variations of the same word were returned. To improve the results, I could look at stemming the words during the cleaning process and seeing how much of a difference it makes to the language models. I’ll leave that for another time.

When reviewing the results of the model, there were instances of where the model incorrectly identified negative sentiment in the comments. Similarly as we have seen above, there were instances where the model assigned topics to comments that weren’t really evidenced in the text. There are changes (such as stemming words and creating a custom stopword list) that can be made during the modelling process to help correct for these and so any future iterations would need to improve this aspect of the modelling features.

Overall, the models did a pretty good job in identifying key themes throughout the reviews. The analysis could have been developed further to identify which themes appear most often together. We could have even explored the themes within the descriptions of accommodations as written by the host responsible for the listing and then compared these to what the guest had written in their reviews. We could also group the review topics for each listing and provide a summarised description of the property as told by past guests. In a practical setting, a user would then be able to filter for listings based on themes from previous reviews that they were interested in identifying when choosing accommodation using airbnb.

Conclusion

The airbnb dataset used for this analysis presented opportunities to answer plenty of intriguing business questions that any data analyst or data scientist would have much fun with. I only picked a few to go through so that I could show the reader how diverse datasets could be used to reveal insight and even help identify new product features for a website or a web app, such as the ability to identify preferred airbnb accommodation based on key themes that were identified in previous guest reviews.

If you’re thinking of visiting Istanbul in the near future, then I hope that this analysis has provided some insight that will help you decide when and where to stay during your trip. And if you’re not thinking of visiting Istanbul, then you should, it’s a great place to go!

Articles & Resources that were Helpful in Completing the Analysis

In no particular order:

--

--