In my part-time I manage a portfolio of rental properties in Cape Town and host them on www.airbnb.com, I would like to use data to assist with the marketing of the properties. The project aims to uncover the factors that influence search results on the Airbnb site. SEO is an area that has seen much focus with the rise of search engines and this project aims to answer what drives search results on Airbnb. Hopefully we can answer what an Airbnb host can do to achieve a better rank, since as a host you’d like your property to appear as close to the top as possible as this leads to more bookings and more money.
The data used is scraped from the Airbnb site itself, specifically in Cape Town where the properties I host are located. What follows is an analysis of the short-term rental market in Cape Town on www.airbnb.com, specifically for rental properties that can accommodate 6 or more people (as this is what the properties I manage offer).
1. Use automated software to download/scrape data from www.airbnb.com
2. Organise the data into a database
3. Analyse what aspects of hosting correlate with listings ranking higher in search results
4. Identify what a host can do to rank higher in the search results
Getting an initial data set
Airbnb’s database is not freely available and Airbnb does not provide an API that allows easy interface with the database. However, we could get the data by going to each page and manually getting the information, this is likely to take a very long time if we want to get a relevant sample size to analyse.
Instead of manually going to every page to get the data we can make use of automated software to do some web scraping. This will automatically download data from their site to later process into a database. To get the initial data for the listings I set up an Extractor on import.io to scan the search page from www.airbnb.com and scrape the following data (highlighted in the picture):
· Page number of the search results
· Listing name
· Listing price
· Link to actual listing
· Number of reviews
· Number of beds
Note that before scraping this information I had set the search criteria to only include listings that could accommodate 6+ people. I also set the map to include pockets of greater Cape Town to limit the number of search results. What import.io allows you to do is create your own API and then feed other URLs into the API that will grab the same data for you from those pages.
The search result page only contains 18 properties, however, at the bottom of the page it shows there are 17 pages of results. The import.io software allows you to add multiple URLs to the request when running the Extractor. All we need to do is replicate the URL for page 1 and feed in the URLs for pages 2–17. I did this by manipulating the string in Excel and creating the URLs for the 17 to pages of results.
Below is an example of a URL for page 1 (highlighted in bold):
You’ll notice that the URL contains the filters we require for the search: 6 adults as well as the latitude and longitude of the search area on the map.
To get the other 17 URLs I did some string manipulation in Excel that substituted the “1” for the numbers 2–17:
The above formula searches through the URL and takes everything left of “page=” adds the page number (in column C) and then adds everything right of “page=”. We can then Autofill down 17 rows and we have our 17 URLs to feed into import.io.
To get more in depth data a more intensive way of scraping the data was needed. There is some data that isn’t obviously available on Airbnb’s site and is hidden away in the HTML and JSON on each listing page. This time I used Python to run a web spider that scraped each listing’s data from the same points on the map. To do this I setup a web scraping spider using the Python library Scrapy.
The method I used was adapted code from an incredibly helpful resource by Luca Verginer, http://www.verginer.eu/blog/web-scraping-airbnb/. The key data I extracted from each listing can be seen within the code to parse the listing data:
This data that is parsed from the JSON array within the listing page provides a lot deeper insight into the listing and more pieces of data to analyse.
Again, we had to run the spider across multiple areas to get all the listings within the suburbs of greater Cape Town. The code is adapted to add a suffix to the URL to only get listings for 6 guests, as well as that area on the map (the GPS co-ordinates). This code loops over all the pages that hold the results of the search:
Put it all together
In a few minutes, I had several csv files each with around 300 listings from the import.io extractors.To join up the files into one master file with all the listings I ran a VBA script that combines sheets to one sheet.
We now have a master file with around 3000 listings that can sleep more than 6 people in greater Cape Town. Since we moved the map to capture certain areas we would have overlapped and included the same listings in multiple search results. To remove duplicates Excel has a handy tool to remove duplicates which now left us with around 2000 listings.
The more detailed Scrapy data was collected into a set of CSV files that had the same fields but in all different orders. The data was loaded into Qlikview which allowed us to use the SQL functionality and UNION the tables together building one big table of data, correctly ordering the fields automatically.
One of the fields is called Amenities which was a list of all the amenity codes a property had, by separating the list into separate fields in Excel and creating a CrossTable in Qlikview we created a further table with Amenities by property as well as their description. The descriptions of the 60 amenities came from trawling through the HTML code in a very manual way unfortunately, at least we only had to do it once!
The last 2 pieces of data were a list of Cape Town’s suburb names from Wikipedia, as well as a file that contained a list of first names and whether they were male or female names. This came from a German site but the data in the available zip file is all I needed to classify the genders of the Airbnb hosts.
We now had the 5 tables of data:
1. Top level “io Data” with the listing name, number of beds/guests the results page it came from
2. Deeper “Scrapy Data” specific to the host and the property including all the property’s amenities
3. A table of amenities by property which we could use to further analyse the data
4. Suburb Data to check whether listings that mentioned suburb in its name fared better
5. Gender Data to see whether Male or Females hosts fared better
Describing the results
When looking at results we are looking for correlation between search result page and the variable being tested. We used the Average for each variable and a good way to interpret the table below is to say: “The average Page 1 listing has a guest satisfaction score of 83.7%”. We will cover the results in more detail later in the report but perhaps unsurprisingly the most important factor influencing search rank is the Guest Satisfaction score that is calculated once a guest completes a review for a listing.
To interpret the results, we are looking for correlation (both positive and negative) with result page. As seen in the table below, as page number increases, average guest satisfaction decreases.
A word about Correlation vs Causation
Some of the results are intuitive and make sense and some may be surprising. One of the factors highly correlated to page rank is the number of words a listing has in its description.This may well be something Airbnb uses in their ranking algorithm or it may be that hosts who have wordy, descriptive listings are more conscientious with all aspects of hosting and therefore perform better. There is no way of knowing exactly what the rank algorithm is comprised of but we can give a very good indication as to what factors tend to result in higher ranking properties. We must be careful not to confuse correlation with causation.
The factors most correlated to page rank
The below table shows the top 5 factors that are most correlated to page rank. We can clearly see the trends in the graphs:
Things to note:
· Guest satisfaction score (from guest reviews) is understandably the most correlated factor.
· Price: anecdotally from my experience, Airbnb has been recommending lower and lower prices as suggested prices. Airbnb wants to offer the best deal to its users so lower prices mean a better rank.
· Word count: as described above, this may be a factor that Airbnb values or may be that wordy descriptions are a characteristic of more conscientious hosts who score well elsewhere too.
· Minimum stay length: perhaps shorter stays get more bookings and therefore score better in other factors influencing rank but it seems the shorter a host’s minimum stay requirements are the higher they rank.
· Days since calendar updated: the more active a host is in updating the calendar the better the rank of the property. Unsurprisingly Airbnb reward active hosts.
Things to note:
· Price/Bed: Since we have details on the number of beds we can figure out price/bed, again Airbnb rewards cheaper listings.
· Name Length: this field can only be 50 characters long but listings with more words (average 5) seem to rank higher. Again, this may be due to other factors.
· Is Instant Book: See below for a more detailed analysis but Instant Book listings perform better
· Reviews: Having more reviews is correlated with ranking higher
· Times Saved to Wishlist: One listing on page 5 was removed as an outlier from this set, it had been saved to wish lists over 22 000 times and skewed the results. (It must have made it onto Airbnb’s featured page or on some other site that gets major traffic.)
Airbnb’s changing stance on Instant Book
Having a listing set to Instant Book (where a host allows potential guests to book without their approval) is correlated with having a higher search result in this data set. This wasn’t always the case… about a year ago, I did some similar research looking for correlation between Instant Book listings and search rank. Below shows how the rank algorithm seems to have changed over the last year:
Clearly there is no real correlation in the 2016 data (0.14 correlation coefficient) but in the 2017 data we can see that listings that have Instant Book enabled tend to appear higher in the search results.
This may be part of Airbnb’s drive to compete with the hotel industry and their stance that hosts should not discriminate against potential guests (by not accepting certain bookings as hosts without Instant Book can do).
From Airbnb’s “Work to fight discrimination and Build Inclusion Report” — Sept 2016:
One Million Instant Book Listings
Instant Book allows certain listings to be booked immediately — without prior host approval
of a specific guest. To achieve these goals, Airbnb will accelerate the use of Instant Book
with a goal of one million listings bookable via Instant Book by January 2017.
More importantly, Instant Book reduces the potential for bias because hosts automatically accept guests who meet these objective custom settings they have put in place. Airbnb has already worked to increase the number of Instant Book listings, which has more than doubled in the past year.
The correlation of all the factors tested
To quickly see which factors are most correlated with search rank we can look at the statistical correlation instead of interpreting the graphs as we did above. Below is a table which shows the correlation coefficients of each factor I tested:
Note: This table shows an absolute correlation coefficient from 0 to 1, 1 being most correlated. I converted the inversely correlated factors (negatives) for easier interpretation.
Things to note:
· Being a SuperHost doesn’t seem to make as big a difference as one would think. It ranked 13th most correlated to page rank
· Airbnb hosting businesses and hosts with multiple listings aren’t correlated to higher search results, nor are listings that are Business Ready
· The ratio of male-to-female hosts didn’t correlate to search results, this was admittedly a long shot! However, there are almost double the number of female hosts in this data set.
· Smoking and pet friendly properties didn’t seem to negatively impact search results.
· Having the suburb name or the base word “view” in the title didn’t correlate with search rank.
· Age of a host’s account (how long they have used Airbnb) didn’t correlate with search rank.
The correlation of amenities to search rank
In the same way that we tested the factors that are correlated to search result rank we can also test the correlation of amenities. Below is a table that ranks the correlation of amenities with search rank:
Things to note:
· Offering breakfast is not correlated to higher search results, it may be a factor for listings that sleep 1 or 2 but not for those accommodating 6 or more.
· Having a TV and having Cable/Satellite TV does not correlate to higher search ranks.
· Business Ready required amenities are more highly correlated to higher search ranks.
· More listings have wireless internet (93%) as an amenity than Internet (57.6%), I can’t explain this. Perhaps some hosts don’t know that Wi-Fi isn’t possible without internet itself?
Based on this research analysing the correlation of host variables and search page rank there several easy things a host can do to get a better search rank:
1. Keep your calendar updated
2. Ask guests to complete a review
3. Lower prices
4. Lower the minimum stay length
5. Enable Instant Book
6. Respond quickly to requests
7. Have the following amenities:
d. Hair Dryer
f. Internet — and list it as internet not only as wireless internet.
The results of this research may be skewed due to this data set only including listings that accommodate 6 or more people. It may also differ to results from other cities/countries. Further research could look to explore the same research methodology on all listings regardless of how many people a listing can accommodate as well in different cities/countries.
About the Author
This research project was completed by Nicholas Child and any views and/or opinions are strictly his own and do not represent those of Airbnb.com