Behind the Scenes: Airbnb Neighborhoods
By Andy Kramolisch & Ben Hughes
Let’s start with some background. Why did you guys focus on Neighborhoods and what is the Neighborhoods project?
Andy: Airbnb Neighborhoods was birthed from the Snow White project and our research showed location is the number one criteria for Airbnb travelers when choosing a place to stay. Our goal with Neighborhoods is for every Airbnb traveler to figure out where in a city to stay and to help them feel more connected to the local culture. We like to call this local intuition the sixth sense of traveling. Neighborhoods is also the successor of our previous company, NabeWise, which was acquired by Airbnb in 2012. At NabeWise, we provided movers and travelers with a comprehensive and essential neighborhood guide that chronicled 25 US cities. By working with Airbnb’s passionate and international community of hosts and guests, we are able to evolve our product and offer truly global, yet locally-nuanced, solutions for enhancing aspirational travel.
How did you guys handle all of the data?
Andy: Neighborhoods itself doesn’t deal with insane amounts of data. Instead we were able to offload the hard work to external services. One such service, Glop (Genome Location Pipeline), regularly associates our data with the neighborhood in which it occurs. It goes something like this:
- Airbnb produces lots of data each day: tracking reservations, new users, new listings, etc.
- Glop is scheduled by Chronos to run.
- Glop churns through all this new data, ignoring it if it isn’t associated with a location (i.e. does not have an associated latitude and longitude).
- Glop looks up each (latitude, longitude) pair to see if we have neighborhood coverage there.
- If Glop sees that something is in a neighborhood we cover, it will then dump that information to flat files and memcached. For example, say you list your place, which is located at (12.333568650219718, 45.43647998034738). The next time Glop runs, it will correctly identify your listing as being in
San Marco. Glop looks something like this:
In order to capture neighborhood boundaries we also built a custom browser-based system for creating the neighborhood geometry. Zack Walker, our cartographer, works with this system to map out each city in very fine grained detail. We’re then able to really play with the geometry and pass it through various filters before importing it into the front end facing app. By the time the front end gets the underlying data, it is relatively small and manageable.
What was the stack?
Andy: There are actually quite a few components, among which are:
Neighborhoods, the App
Server Side
* Rails 3.2
* PostgreSQL/PostGIS
* Memcached
Client Side
* CoffeeScript
* Sass
* jQuery
* Handlebars
* Backbone
* Underscore
Neighborhoods, the API
Server Side
* Sinatra
* PostgreSQL/PostGIS
The Neighborhood Tool
Server Side
* Rails 3.2
* PostGIS
* nsync for data versioning
Neighborhoods, the Data Pipeline
Server Side (EMR)
* Clojure
* Java
* Hadoop
* Memcached
* Cascalog
What was your biggest challenge building Neighborhoods?
Andy: My biggest challenge was engineering the best way to give the front-facing Rails app (henceforth neighborhoods-core) access to all the data produced by the pipeline. Neighborhoods-core reads data produced by the pipeline to personalize pages and produce the community visualization. What we needed was a solution that could lookup resources by city or neighborhood. We also wanted our solution to be fast. Very fast. The “resources” we needed to fetch are de-normalized tuples representing a variety of types of data. A single resource tuple could represent a reservation, a listing or even a user. At first, it seemed we wanted a SQL database, as our data had relations. However, this was ruled out based on the need for mass updates. Next, we looked at an in-house NoSQL solution that we call Dyson. Dyson seemed to give us the flexibility we needed with writes and updates, so we tried it. For reference, Dyson is backed by Amazon’s DynamoDB, a reliable, but limited, managed, NoSQL solution. In essence, if we put the data right into DynamoDB, then Dyson can serve it. This led to the creation of a DynamoDB cascading tap. Countless timeouts, headaches and late nights later, we had a working solution. However, there was a problem, namely DynamoDB’s 65KB storage limit. When you’re storing uncompressed JSON, that’s a pretty easy target to reach. As a band-aid, we engineered a solution involving pages of tuples. To say this solution was sub-optimal is putting it mildly, and the performance was even worse. With launch quickly approaching, brilliant words saved the day: “You don’t need a database, you need a [expletive deleted] cache” 1. So that’s what we did, we traded our database for a cache. Specifically, we switched from Dyson to Memcached. How does this story end? 35ms response times.
Ben: My biggest challenge was setting up the neighborhood page layout tools. We needed tools in place to allow our content editors, translators, and photographers to begin work before we were even close to final designs. We also realized pretty early on that we needed to allow considerable flexibility in how pages would be laid out. Additionally, this tool had to be easy to use so that it wouldn’t waste our content editors’ time, as that was the limiting factor in whether we would be able to ship. I ended up creating a drag-and-drop page creation system that could import images from Finder, iPhoto, or any other photo viewing software our people were using. Once imported, images could be edited, reflowed, and captioned on the page. We also ran into a ton of issues because all of our photos are very high resolution and took a while to process. To speed things up, I wrote a high performance image processing server in Clojure that essentially injected itself as a proxy in front of the Rails image upload endpoint. Unfortunately, we ran up against some fairly bad image quality issues for certain images that didn’t occur when processing using imagemagick, so we were never able to fully roll it out.
What did you guys learn?
Ben: It’s important to consider performance from day one. On this project, we kept New Relic Development Mode open in a separate tab at basically all points during development. This allowed us to constantly monitor what our app was actually doing, rather than hoping we had written fast code and then trying to bolt speed on at the last second. We also made our app akamai friendly from the start, so static page caching was just a matter of setting the right headers. You can check out Neighborhoods here: airbnb.com/neighborhoods
1 The brilliant man responsible for this observation is one Davide Cerri.
Check out all of our open source projects over at airbnb.io and follow us on Twitter: @AirbnbEng + @AirbnbData
Originally published at nerds.airbnb.com on April 10, 2013.