Here at Airbnb, as you can probably imagine, we’re big fans of travel. We love thinking about the diversity of experiences our host community offers, and we spend a fair amount of time trying to make sense of the tens of thousands of cities where people are booking trips every night. If Apple has the iPad and iPhone, we have New York and Paris. And Kavajë, Außervillgraten, and Bli Bli. The tricky thing is, most of us haven’t been to Bli Bli. So we try to come up with creative ways to help people find the experience they’re looking for in places we know very little about. The key to this is our search algorithm — a system that combines dozens of signals to surface the listings guests want. In the early days, our approach was pretty straightforward. Lacking data or personal experience to guide an estimate of what people would want, we returned what we considered to be the highest quality set of listings within a certain radius from the center of wherever someone searched (as determined by Google).
This was a decent first step, and our community worked with it resiliently. However, for a company based in San Francisco, we didn’t have to look far to realize this wasn’t perfect. A general search for our city would return great listings but they were scattered randomly around town, in a variety of neighborhoods, or even outside of town. This is a problem because the location of a listing is as significant to the experience of a trip as the quality of the listing itself. However, while the quality of a listing is fairly easy to measure, the relevance of the location is dependent upon the user’s query. Searching for San Francisco doesn’t mean you want to stay anywhere in San Francisco, let alone the Bay Area more broadly. Therefore, a great listing in Berkeley shouldn’t come up as the first result for someone looking to stay in San Francisco. Conversely, if a user is specifically looking to stay in the East Bay, their search result page shouldn’t be overwhelmed by San Francisco listings, even if they are some of the highest quality ones in the Bay Area.
So we set out to build a location relevance signal into our search model that would endeavor to return the best listings possible, confined to the location a searcher wants to stay. One heuristic that seems reasonable on the surface is that listings closer to the center of the search area are more relevant to the query. Given that intuition, we introduced an exponential demotion function based upon the distance between the center of the search and the listing location, which we applied on top of the listing’s quality score.
This got us past the issue of random locations, but the signal overemphasized centrality, returning listings predominantly in the city center as opposed to other neighborhoods where people might prefer to stay.
To deal with this, we tried shifting from an exponential to a sigmoid demotion curve. This had the benefit of an inflection point, which we could use to tune the demotion function in a more flexible manner. In an A/B test, we found this to generate a positive lift, but it still wasn’t ideal — every city required individual tweaking to accommodate its size and layout. And the city center still benefited from distance-demotion. There are, of course, simple solutions to a problem like this. For example, we could expand the radius for search results and diminish the algorithm’s distance weight relative to weights for other factors. But most locations aren’t symmetrical or axis-aligned, so by widening our radius a search for New York could — gasp — return listings in New Jersey. It quickly became clear that predetermining and hardcoding the perfect logic is too tricky when thinking about every city in the world all at once.
So we decided to let our community solve the problem for us. Using a rich dataset comprised of guest and host interactions, we built a model that estimated a conditional probability of booking in a location, given where the person searched. A search for San Francisco would thus skew towards neighborhoods where people who also search for San Francisco typically wind up booking, for example the Mission District or Lower Haight.
This solved the centrality problem and an A/B test again showed positive lift over the previous paradigm.
However, it didn’t take long to realize the biases we had introduced. We were pulling every search to where we had the most bookings, creating a gravitational force toward big cities. A search for a smaller location, such as the nearby surf town Pacifica, would return some listings in Pacifica and then many more in San Francisco. But the urban experience San Francisco offers doesn’t match the surf trip most Pacifica searchers are planning. To fix this, we tried normalizing by the number of listings in the search area. In the case of Pacifica, we now returned other small beach towns over SF. Victory!
At this point we were close to solving the problem, but something still didn’t feel right. In the earlier world of randomly-scattered listings, there were a number of serendipitous bookings. The mushroom dome, for example, is a beloved listing for our community, but few people find it by searching for Aptos, CA. Instead, the vast majority of mushroom dome guests would discover it while searching for Santa Cruz. However by tightening up our search results for Santa Cruz to be great listings in Santa Cruz, the mushroom dome vanished. Thus, we decided to layer in another conditional probability encoding the relationship between the city people booked in and the cities they searched to get there.
The relationship between the two conditional probabilities we used is displayed in the graph to the right. While all of the cities in the graph have a low booking likelihood relative to Santa Cruz itself, they are also mostly small markets and we can give them some credit for depending on Santa Cruz for searches for their bookings. At the same time places like San Jose and Monterey have no clear connection to Santa Cruz, so we can consider them as completely separate markets in search. It’s important that improvements to the model do not lead to regressions in other parts of the world. In this case, little changed for our bigger markets like San Francisco. But this additional signal brings back the mushroom dome and other remote but iconic properties, facilitating the unique experiences our community is looking for. The location relevance model that we built during this effort relies completely on data from our users’ behavior. We like this because it allows our community to dynamically inform future guests where they will have great experiences, and allows us to apply the model uniformly to all of the places around the world where our hosts are offering up places to stay. * * *Huge thanks to
Check out all of our open source projects over at airbnb.io and follow us on Twitter: @AirbnbEng + @AirbnbData
Originally published at nerds.airbnb.com on May 1, 2013.