Exploring the Data and Building a Proof of Concept

Kai Lu
Kai Lu
Aug 27 · 6 min read
Taken on Cancún Trip

First off, I would like to apologize for the lateness of this post. I’ve been putting this project off for a while to focus on work. Since I ended work last week, I wanted to give an update on where I got to in this project. -Kai

The Problem

For those who stumbled upon this post by accident, you may be wondering what problem I am trying to solve. If you have 7 minutes, I recommend checking out my first post of this series:

If not, here’s a quick rundown:

I’ve found that planning an Airbnb trip is a time-consuming process with not much direction after we filter down by constraints such as the number of bedrooms or number of guests we need to accommodate. We often have to rely on pictures and looking through many pages of listings to find the best home for our trip.

My solution is to build a holistic score based on features indicative of a good Airbnb experience and weight them based on user preferences to rank listings from first to last. This way, we can easily choose the top listings and have a much smaller sample to choose from, thus expediting the process.

Initial Concerns

My biggest concern with this problem is that I have no target or source of truth to optimize for. It is not the same as a classic supervised-machine learning problem where my objective is clear (e.g. minimize residuals for a continuous response).

Here, I would not know the performance of my model as there is no single, best ranking for homes. It is simply impossible to find a source of truth as preferences differ from person to person. My best bet here is to provide an easily interpretable model where the user can simply input their preferences on several categories of metrics and then the listings will be ranked by the weightings of those specifications. The user should be able to quickly understand what the holistic score is composed of.

Cancun as a Case Study

Since I have a data point I can reference in Cancún (from a Spring break trip), I did my proof of concept for this location. Here is a brief walkthrough on how I did the data collection and cleaning process:

Data Collection

One observation I found when attempting to scrape Airbnb listings is that each distinct filter will only return ~300 listings or ~17 pages of listings.

This meant that I would need to run my spider multiple times with different filters to attempt and get as close to the population as possible. Since my spider is able to take price arguments, I wrote a simple bash script to loop over the price range (20, 990) in 10 step intervals and run my spider in each iteration. I ran this 5 times with each run averaging around 2 hours and concatenated all the JSON files together and removed the duplicate listings by the listing_key. I ended up with 6673 distinct home listings, which I thought was good enough to work with as an initial sample for proof of concept.

Data Cleaning

I started off with 37 columns at a listing grain, which after some removal of low variance, high missing, or not informative (of a good Airbnb experience) columns, I ended up with 21 columns. However, I believed it was important to split up my variables into filter and quality categories (I made these up).

Filter variables refer to fields pre-specified by the user to exclude listings that do not meet the trip’s requirements. Some examples of filter variables are number of beds, number of bathrooms and price. This is similar to Airbnb’s current user experience. Here are the filter variables I chose:

  • number of bathrooms
  • number of bedrooms
  • number of beds
  • person capacity
  • room type category
  • price
  • is superhost
  • can instantly book
  • is new listing
  • is fully refundable
  • latitude
  • longitude
Airbnb Filter UI

Quality variables refer to fields that are indicative of a good Airbnb experience. This is where this project shines and is an additive component of Airbnb’s current user experience. Some examples of quality categories are guest satisfication, number of reviews and response rate. The idea is to create a holistic score for each listing from this set of variables as components that build up the score according to the user’s preference. Here are the quality variables I chose:

  • guest satisfaction
  • number of listing reviews
  • number of host reviews
  • response rate
  • weekly price factor
  • monthly price factor

As a result, after the user has chosen his or her numbers for filter variables, the algorithm will output a list of homes from best to worst based on the holistic score from quality variables.

A Simplified Example

I wanted to provide a simple example of how I envision this project to work.

Step 1: Apply the Filter

Suppose I am looking for a home in Cancún for around 150–200 USD for 4 guests, meaning that I also need 4 beds. Let price, number of guests and number of beds be my only filters here. This returns 172 listings that satisfy these constraints.

Step 2: Apply the Weighting and Rank Listings

Since we only have 6 quality variables, I thought it would be easy enough to ask the user to rank the importance of each from an ordinal 1–5 scale with 5 being most important and then compute the weighting by dividing each variable’s score by the total allocated score.

Now suppose, I am only looking to book for a 4 days trip so columns weekly price factor and monthly price factor are irrelevant. Assume that I treat each of guest satisfication, host reviews, listing reviews, and response rate equally, giving them each equal weighting.

I then can compute the percentile rank of each listing and multiply it by the weighting. After summing up the percentile ranks * weighting columns for each listing, I can order the listings from first to last, with the listing that has the highest score being the “holistically best” listing.

The Process

Now, we have a ranking of listings from first to last which concludes our proof of concept. Since I actually booked a listing with these constraints a few months ago, I’m curious how I did...

The listing I stayed in came in 3rd out of 172 listings which is a bit surprising to me! Turns out human intuition and deciding based on pictures does great already. To be fair, I had a great experience staying in that home and did enjoy the host’s quick responsiveness and the value that was provided. Perhaps the listing ranking algorithm for Airbnb is way better than I thought!

Still, I am having a lot of fun with this project so far and hope to continue it and look to connect all my components together when I more time in the Fall. The main problem I have currently though is the quality of data I am getting as data collecting through web-scraping is not ideal. Maybe Airbnb will release its API to the public one day! :)

Thanks for reading!

The notebooks used for this project live here:

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Kai Lu

Written by

Kai Lu

Studying Mathematical Economics and Statistics at UPenn | Data Science Intern @Shopify

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade