The purpose of this post is to give an initial overview of the problem I will be trying to solve in the next weeks as well as the tools and methods I would like to use to do so. I will try and separate my project into several bite-sized posts that are easy to digest. At this moment, the project is not finished and I would love any feedback and advice!
I remember browsing through Airbnb listings in March in preparation for a Spring Break trip with some friends. It was a time-consuming process as it was difficult to pick the “optimal” listing from 17 pages of listings where the average rating was 4.7. Though we knew what price range we expected to pay, it ended up being quite a frustrating process to reach the final decision. If I recall correctly, it went something like this:
- Compile a list of “promising” listings on a Google Doc
- Plan a time and location to meet (which is way more difficult than it seems)
- Argue over which listing was the “best”
- Decide that it is probably better to narrow down the list than to argue
- Repeat Steps 3 and 4
You get the idea. It took us ages to reach an agreement and we spent much of our time comparing picture to picture which is at times subjective and inconsistent (Some listings may’ve hired professional photographers).
What if we could create a model that after considering our budget and logistical constraints (e.g. price, # of bedrooms, etc.) can rank listings according to a holistic score composed of key metrics that we all cared about?
Now a tool like that would speed up our decision process by quite a bit! At the very least, it would provide us an already curated and filtered list to have a polite and civil discourse about!
1. Collecting the data
Before we can build a model, it seems that we may need some data first. So let’s see if we can get our data through an API from Airbnb…
It looked to me that the only way to get our much needed data is through scraping it directly from the listings. Now, this turned out to be a more complicated task than I anticipated.
The solution he proposed was to use a plugin created by the Scrapy team that integrates nicely with
Splash. This is what I used for my
scrapy spider. Combined with
scrapy , I was able to crawl Airbnb listings quickly and in a systemic way.
Note: When web-scraping, it’s best to always obey the
robots.txt file and to not hit the website too frequently. You can toggle these in the
settings.py file in every
I will be going in depth with building a
scrapy spider to scrape Airbnb in a separate post, which is now here. Note that this is more of an introductory tutorial.
A Gentle Introduction to Using Scrapy to Crawl Airbnb Listings
In this post, I will be going into full detail on how to scrape Airbnb listings. Airbnb is a website that allows users…
You can also check out the spider (more complex) I made specifically for my project here:
Spider built with scrapy and ScrapySplash to crawl Airbnb listings - kailu3/airbnb_scraper
It’s important to note that web-scraping is not a reliable way of getting data. Since Airbnb changes the layout/architecture of its site frequently, I cannot guarantee that the spider will always be working but I’ll try my best.
2. Cleaning the Data & Building the model
After concatenating my data files together from multiple scrapes and removing duplicate listings, I needed to do some data cleaning and preparation before the model building. This process included removing unnecessary, low-variance, high-correlation columns and dealing with missing values. I will go into more detail in my next post.
As for building the model, this is unlike any supervised machine learning problem. There is in fact no target variable to predict! Additionally, there is no way I’ll be able to measure the performance of this model (since we don’t have feedback from using the model).
Even so, the basics of optimization still holds. Like any optimization algorithm, there are three basic elements:
- Variables: These are the parameters (e.g. price, # of reviews, rating) that I will use to build my holistic score.
- Constraints: These are the boundaries in which my variables/parameters need to stay within (e.g. 200 < price< 300, bedrooms ≥ 3)
- Objective Function: This will be the function that needs to be minimized or maximized.
Since performance cannot be measured, I believe it is highly preferable to create an easily-interpreted model. Right now, one of my ideas is adopting a similar logical structure to how we set up maximizing or minimizing problems in Economics. This is a structure heavily used in the Intermediate Microeconomics course I took last Fall. Essentially we minimize/maximize the objective function under certain constraints such as Income to determine the optimal strategy (usually by solving the Lagrangian).
Consider this problem:
Suppose two goods, 1 and 2 and denote by xᵢ the quantity of good i = 1, 2 purchased for consumption. The utility function of our agent is U(x₁, x₂) = 16x₁⁴x₂⁸. If pᵢ is the price per unit of good i, this agent’s utility maximization problem is (where I is the agent’s Income):
This problem can be solved using the method of Lagrangian multipliers and we could obtain the optimal buying strategy of the agent to maximize his utility under the budget constraint. Though the math is interesting, I won’t be going over it in this post.
Likewise for our problem in determining the best listings, we can define some utility function U(xᵢ, …, xₙ) for the party under multiple constraints.
Here the constraints simply act as filters which is the same as how you can select constraints and filter down the listings for your party’s needs on Airbnb’s website. After creating a list of acceptable listings, we can then rank them in terms of their utility which is determined by U(xᵢ, …, xₙ).
Now creating this utility function is the biggest challenge of this project. My initial thought is to create some normalized weightings of metrics/variables that are indicative of a better Airbnb experience. The criterion for this is at the moment in exploration phase. Some of the many variables that I will consider and easily come to mind are
number of reviews,
overall rating. The idea is to combine a hand-picked list of variables indicating success (a good experience) into a holistic score that I can rank listings by. This is a task i still need to put in some more thought in the next weeks. Any recommendations or advice are welcome!
3. Automating the Process
Automation! After creating the spider and the model to rank the listings, I would like to get from inputting my constraints to getting a ranked list of optimal listings in a single-click. I would be interested to look into
Flask to see how I can deploy the model on the front-end and possibly even learning
Airflow to manage my workflow. There is just so much to learn! More on this later.
I am super excited to be working on this project for the next few weeks! I plan to release the next two posts on a weekly basis. Currently, my plan looks like this:
How to Scrape Airbnb Listings with Scrapy and Splash → July 8th [Done: July 13th]
(II) Data Cleaning and Model Creation → July 15th [In Progress]
(III) Automating the Whole Process → Hopefully by end of July!
That it! Thanks for reading and Happy Canada Day! 🇨🇦