Improving Deep Learning for Ranking Stays at Airbnb
Search ranking is at the heart of Airbnb. Data from search logs* indicate it is a feature used by more than 90% of guests to book a place to stay. In ranking, we want the search results (referred to as listings) to be sorted by guest preference, a task for which we train a deep neural network (DNN).
The general mechanism by which the DNN infers guest preference is by looking at past search results and the outcome associated with each listing that was shown. For example, booked listings are considered preferred over not booked ones. Changes to the DNN are therefore graded by the resulting change in booking volume.
Previously, we’ve focused on how to effectively apply DNNs to this process of learning guest preference. But this process of learning makes a leap — it assumes future guest preferences can be learned from past observations.
In this article, we go beyond the basic DNN building setup and examine this assumption in closer detail. In the process, we describe ways in which the simple learning to rank framework falls short and how we address some of these challenges. Our solution is what we refer to as the A, B, C, D of search ranking:
- Architecture: Can we structure the DNN in a way that allows us to better represent guest preference?
- Bias: Can we eliminate some of the systematic biases that exist in the past data?
- Cold start: Can we correct for the disadvantage new listings face given they lack historical data?
- Diversity of search results: Can we avoid the majority preference in past data from overwhelming the results of the future?
The motivation for a better architecture came from the fact that the DNN-inferred guest preference seemed out of touch with the actual observed preference. In particular, guest bookings were skewed towards economically priced listings, and the median price of booked listings was lower than the median price of search results shown. This suggested we could get closer to true guest preference by showing more lower priced listings, an intuition we referred to as cheaper is better. However, explicitly applying price-based demotion to the DNN ranked results led to a drop in bookings.
In response to this, we discarded the cheaper is better intuition, realizing what we really needed was an architecture to predict the ideal listing for the trip. The architecture, which is shown below in Figure 1, has two towers. A tower fed by query and user features predicts the ideal listing for the trip. The second tower transforms raw listing features into a vector. During training, the towers are trained so that booked listings are closer to the ideal listing, while unbooked listings are pushed away from it. When tested online in a controlled A/B experiment, this architecture managed to increase bookings by +0.6%.
One challenge in inferring guest preference from past bookings is that the booking decisions are not solely a function of guest preference. They are also influenced by the position in which the listings are shown in the search results. Attention of users drops monotonically as we go down the list of results, so we can infer that higher ranked listings have a better chance of getting booked solely due to their position. This creates a feedback loop, where listings ranked highly by previous models continue to maintain higher positions in the future, even when they could be misaligned with guest preferences. Figure 2 below shows how the number of clicks a listing receives decays by its position in ranking, independent of the listing quality. The decay is shown per device platform.
To address this bias, we add position as a feature in the DNN. To avoid over reliance on the position feature, we introduce it along with a dropout rate. In this case, we set the position feature to 0 probabilistically 15% of the time during training. This additional information lets the DNN learn the influence of both the position and the quality of the listing on the booking decision of a user. While ranking listings for future users, we then set the input position to 0, effectively leveling the playing field for all listings. Correcting for positional bias led to an increase of +0.7% in bookings in an online A/B test.
One clear scenario in which we cannot rely on past data is the case where previous data does not exist. This is most obvious in the case of new listings on the Airbnb platform. Via offline analysis, we observed there was much room for improvement when it came to ranking new listings, especially relative to their “steady-state” behavior, once enough data had been collected.
To address this cold start issue, we developed a more accurate way of estimating the engagement data of a new listing rather than simply using a global default value for all new listings. This method considers similar listings, as measured by geographic location and capacity, and aggregates data from those listings to produce a more accurate estimation of how a new listing would perform. These more accurate predictions for new listing engagement resulted in a +14% increase in bookings for new listings and an increase of +0.4% for overall bookings in a controlled, online A/B test.
Diversity of Search Results
Deep learning enabled us to create a powerful search ranking model that could predict the relevance of any individual listing based on its past performance. However, one angle that was missing was a more holistic view of the results shown to the user. By only considering one listing at a time, we were unable to optimize for important properties of the overall result set, such as diversity. In fact, we observed that many of the top results seemed similar in terms of key attributes, such as price and location, which indicated a lack of diversity. In general, diverse results can contribute to a better user experience by illustrating the wide breadth of available choices rather than redundant items.
Our solution to address diversity involved developing a novel deep learning architecture, which consisted of Recurrent Neural Networks (RNNs), to generate an embedding of the query context using the entire result sequence. This Query Context Embedding is then used to re-rank the input listings in light of the new information about the entire result set. For example, the model could now learn local patterns and uprank a listing when it is one of the only listings available in a popular area for that search request. The architecture for generating this Query Context Embedding is shown in Figure 3 below.
Overall, we found this led to an increase in the diversity of our search results, along with a +0.4% global booking gain in an online A/B test.
The techniques described above enabled us to go beyond the basic deep learning setup, and they continue to serve all searches on Airbnb. That being said, this article touches on just a handful of the considerations that go into how our DNN works. Ultimately, we consider over 200 signals in determining search ranking. As we look into further improvements, a deeper understanding of guest preferences remains our guiding light.
Our papers published in the KDD conference go into greater technical depth:
- Improving Deep Learning for Airbnb Search goes into the details of the neural network architecture, tackling positional bias and cold start. KDD’2020
- Managing Diversity in Airbnb Search is dedicated to the techniques we used to improve diversity in search results. KDD’2020
- Applying Deep Learning to Airbnb Search describes how to effectively apply DNNs to search ranking. KDD’2019
We always welcome ideas from our readers. For those interested in contributing to this work, please check out the open positions on the search team.
*Data collected during first two weeks of Aug 2020.