Expedia Group Technology — Data

Choosing the Right Candidates for Lodging Ranking

How Expedia Group ranks website search results

Adam Woznica

Published in

Expedia Group Technology

13 min readMay 21, 2024

Author: Adam Woznica and Meli Sedghi

A holiday maker floats in a large swimming pool. The pool has a central island with palm trees on it. — Photo by Luke Bender on Unsplash

Introduction

In the realm of large-scale recommender and ranking, the goal of candidate generation is to limit the pool of items to a manageable size for further processing and re-ranking. An effective candidate generation step yields a limited yet relevant selection of items, often adhering to strict latency constraints.

A prominent example of a large-scale recommendation system at Expedia Group™ is lodging ranking for customer searches. The searches happen across different Expedia Group brands such as Expedia, Hotels.com, and Vrbo. Consumers can easily compare offerings between competing sites, which puts strict constraints on ranking quality, latency, and data freshness. Providing the most relevant lodging recommendations, as fast as possible, and accompanied by precise availability and pricing information, maximizes the chance of winning the sale.

In the ranking phase, properties are first scored according to customer utility, prioritizing properties relevant to a specific customer and query. The relevance is determined by implicit customer feedback, such as bookings and clicks. The list is then adjusted to incorporate additional objectives, including monetization, supplier relationships, diversity, and considerations related to cold-start scenarios [1]. Models used in the utility scoring phase are trained over historical shopping data using various deep learning algorithms [2]. These models are based on multiple signals that fall into the following main categories:

Search context. These features correspond to property searches (e.g., point of sale, check-in and check-out dates, number of adults, etc.)
Static property characteristics. Characteristics that do not depend on a particular query. Examples are user review rating, location, historical price, etc.
Dynamic property characteristics. Characteristics that depend on a particular query. Examples are availability, displayed price, promotion flags, etc.
Visitor purchase, click, and interaction history.

A more comprehensive description of the various features used by ranking models can be found in [3,4].

Unlike in other large-scale use cases where recommendations often need to be generated from extremely large populations of items, the candidate population in lodging ranking at Expedia Group is usually smaller, as is typically restricted by search destinations. The destinations range in size from small, through medium (10,000+ properties in regions like New York, Los Angeles and Orlando), to large like states and whole countries. The larger the number of properties, the longer it takes to generate a final recommendation. Below we show the size of Expedia Group lodging inventory for five sample destinations.

A bar plot of the number of lodging inventory for five sample destinations, from smallest to largest. — The number of lodging inventory at Expedia Group for five sample destinations, split by conventional lodging and vacation rentals. Note the logarithm scale on the y-axis.

Even though lodging ranking at Expedia Group does not involve a large item pool size, ranking all candidates remains impractical. The primary reason is the availability and pricing step where, given a search query for each property, we identify availability per room type and retrieve corresponding prices, including discounts. We cannot skip this step because live availability and pricing signals are among the most important for ranking. The live pricing stage is particularly costly for conventional lodging properties because each room type is usually assigned with multiple rate plans, each of them generally restricted to different combinations of search factors such as booking sites, stay dates, number of adults, etc. Additional considerations are variable taxes and fees that depend on jurisdiction of search and supply, personalized discounts, coupons, etc. The combinatorial explosion of these search factors, as well as frequent changes in prices and availabilities, overtax typical optimization strategies. Extensive caching is already implemented, but with a large number of pricing combinations possible, it alone is not sufficient for optimal performance and future growth.

To address the above problems, we implemented a two-stage recommendation process where we prioritize pricing a smaller number of properties in any search region, and then perform a computationally expensive ranking pass on the smaller candidate set. Not only does this approach reduce the overall latency but also enables ranking with more complex and computationally expensive models. Depending on the accessibility of historical data, we distinguish two main approaches for item prioritization in the candidate generation stage.

The preferred approach is based on machine learning (ML) scoring models trained on historical shopping data. Depending on the input features used in these models, we can distinguish:

Static models that rely only on static property characteristics, resulting in property scores that are common across searches and hence amenable for efficient caching.
Dynamic models that in addition to static features make use of search context, and possibly non-real-time, lower-accuracy pricing and availability signals. Such scores cannot be efficiently pre-computed and cached.

If historical shopping data is unavailable, the relevance of a given property can be determined heuristically by combining historical performance indicators such as booking requests, conversion rate (CVR) [5], click-through rate (CTR) [6], etc. This approach falls into the category of static models as it does not rely on search context and dynamic features.

The overall schema of the two-stage ranking approach is graphically depicted in the following diagram. The candidate generation stage can be ML or heuristic-based. The static property features store contains signals such as review rating, popularity, historical CVR, and CTR. The dotted lines denote potential dependencies: the ML-based candidate generation step may or may not use search parameters, and non-real-time pricing and availability signals.

A diagram of a two stage approach. A candidate generation step is followed by a ranking step. — Two-stage ranking approach.

In the next two sections, we will describe machine learning and heuristic-based approaches in more detail.

Machine learning approach

As previously mentioned, the machine-learning-based approach is preferable when historical shopping data is available. Typically, the data are organized around sets of “search result impressions,” representing the ordered list of properties displayed to consumers after a lodging search on one of the Expedia Group websites. These logs encompass search parameters and property features, both static and dynamic. The data does not usually distinguish between room types, assuming it applies to the least expensive room type. Ideally, these logs should be unbiased by existing candidate generation strategies, containing final sort ranks for all properties in a search destination. However, due to the expense of obtaining such data, reliance is often placed on shopping data covering a subset of properties in larger destinations, already prioritized by an existing candidate generation strategy. We have analyzed to quantify this bias where we experimented with various candidate generation strategies and candidate set sizes, starting from small thresholds to the whole market in sample destinations. We concluded that given the selection, bias is minimal and can be deemed negligible for our use case.

The ML modeling approach is that of learning-to-rank [7] based on item relevancies determined by final ranks. The relevancy for an item i can be binary, based on a predefined threshold n:

Alternatively, we can use smooth relevancies:

A formula for a smooth relevancy defined as the reciprocal of deep rank.

We conducted experiments with diverse learning-to-rank methodologies, spanning from linear models and factorized machines to more complex deep learning architectures. The latter models are currently used in production with the following overall architecture:

A diagram for deep learning architecture with three separate inputs and one output. — Generic deep neural network architecture for candidate generation.

The overall data loop strategy is outlined in the diagram below. Candidate generation is employed to narrow down the number of properties for large destination searches; the reduced candidate pool undergoes ranking (1), and the resulting final ranks are logged (2). Next, the logs are converted into training data for the candidate generation model (3), where item relevance is determined based on final ranks. Learning-to-rank models are subsequently trained (4) using a loss function that penalizes discrepancies in items’ relevance. These models are then utilized in the candidate generation phase (5).

A diagram for data loop with five boxes: candidate generation, ranking, logs, training data, learning-to-rank modelling. — Data loop for the ML-based candidate generation.

The candidate generation models are evaluated using the recall metric which measures the proportion of relevant items that were successfully retrieved by a model, out of the total number of relevant items:

where

True positives (TP) are the items that are both relevant and correctly recommended by the system.
False negatives (FN) are the items that are relevant but not recommended by the system.

The above definition hinges on the notion of relevant and recommended items which for a single search are simply defined as top n ranked items and top k pre-selected items, respectively. This is graphically presented in the following diagram.

A schematic representation of true positives, false positives, true negatives and false negatives for a single search. — Relevant and predicted items for a single search.

Different values of the n and k parameters give rise to various recall variants. We defined two variants of recall that we found useful in assessing the performance of our ML models:

· Recall n=300, k=300, which is a single metric summarizing predictive performance.

· Cumulative Recall Curve, Recall n=300, k, for k=50…300, which measures the ability of the model to predict top 50 ranking with varying pre-selection depth k.

The final recall value is obtained by averaging recall across all the searches in validation data. We evaluated three models with increased complexity that differ with respect to the input signals:

Static model which is only based on static property features,
Dynamic model which is based on static property features and search parameters,
Dynamic with pricing model which is based on static property features, search parameters, and various non-live pricing and availability features.

The performance in terms of Recall n=300, k=300 for these three models is provided in the following plot. Increased model complexity gives rise to better predictive performance; the most significant increase in performance is achieved by including search parameters.

Predictive performance for “static”, “dynamic” and “dynamic with pricing” variants. The “dynamic with pricing” variant outperforms the alternatives. — Predictive performance.

As expected, cumulative recall curves, Recall n=300, k, for k=50..300 , increase with the pre-selection depth. More specifically, with the dynamic model with pricing, it is possible to predict on average 97% of the top 50 ranked properties with a pre-selection depth of 300.

Cumulative recall by selection sort rank for “static”, “dynamic” and “dynamic with pricing” variants. The “dynamic with pricing” variant outperforms the alternatives. — Cumulative recall.

Heuristic-based approach

In the absence of historical ranking logs, one can adapt an alternative approach where instead of learning the scoring function from the ranked lists, we came up with a heuristic-based performance score to estimate the relevance of a given property based on historical performance signals such as a number of bookings, CTR, CVR, etc. To identify the most important features built off the raw signals, and how each feature should contribute to the final score, we performed an extensive statistical analysis on historical performance data. For example, as a booking event is a much stronger signal than a property details page view, we need to account for that proportionally in our functional form. The heuristic-based scores are static, i.e., they do not rely on search parameters.

It is important to note that lodging ranking introduces an inherent bias by consistently choosing to show some items over others (selection bias), or consistently putting some items higher than others (position bias). For example, if property A is shown more frequently than property B, then property A would be more likely to have better historical performance. Similarly, exposure in higher ranks would result in more engagement. To account for these phenomena, we defined an auxiliary metric which we call the visibility score to measure the exposure of a given property, to “normalize” the observed performances. This metric acts as our confidence level on the observed historical performances so that we adjust the historical values more if there is not enough historical data to rely on. For properties with low visibility scores, we are less confident in their historical performance, therefore, we would like to assign a larger variance to low-visibility data points; on the contrary, the properties that have had enough exposure, and hence are bound to have high visibility scores, will be assigned lower variability.

To achieve this, we are inspired by the Upper Confidence Bound algorithm [8], widely used in multi-arm bandit problems. This method uses uncertainty in action-value estimates to tackle the exploration-exploitation dilemma; instead of using the sampled mean values, it maximizes the upper-bounded values to drive exploration.

A formula for the Upper Confidence Bound.

Where at time step t, A is the chosen action, Q is the sample mean value, and U denotes the upper bound which is an inverse function of number of trials for this action so far.

The bounds are derived from confidence intervals typically obtained by Hoeffding’s inequality. We adapt this approach to our use-case, where the performance scores can be deemed as observed sample means for the values, and the visibility score can be utilized to derive the confidence intervals around them. As a result, we achieved the goals of assigning higher variability for properties with lower exposure and rewarding those that outperformed their peers within the same exposure segment.

The heuristic-based candidate generation, even though easier to implement and maintain than the ML-based approach, proved to be powerful in practice as it significantly improved the overall ranking quality for one of the Expedia Group brands, as measured in an A/B test [9].

Cold start problem

An important consideration in the context of candidate selection is the cold start problem: giving newly added items a fair chance to appear in the ranking set. To tackle this problem, we decided to implement an explicit boosting mechanism in the selection, to ensure a certain portion of the candidate set is composed of new listings. The main challenge here is that the market composition varies considerably market by market, so it’s hard to maintain the same portion across different destinations. For example, if a large market contains several sub-markets with many new listings in each, the boost of these listings in each sub-market can result in suboptimal selection in the larger market. Moreover, for static scores, the candidate set generation is triggered with a search, but the scores are calculated offline. To overcome these challenges, we implemented two different cold start strategies.

The first strategy amounts to interleaving two selection lists, one focused on new listings and one focused on non-new listings, in a way that the final list consists of at least p < k new elements where k is the number of pre-selected items. Established items are prioritized by pre-sort scores whereas new elements are usually selected randomly (uniformly or proportionally to their pre-sort scores). While conceptually simple, this approach is harder to implement in practice because it requires a two-stage candidate generation process: In the first stage, we independently rank new and established properties; and in the second stage, we merge these two lists. This approach can be used in the context of both static and dynamic candidate generation models.

In the second strategy, we utilize the H3 indexing, which is a hierarchical and hexagonal indexing system that covers the earth’s surface [10]. It provides a very efficient indexing and querying method for spatial data. The grids can be in different resolutions, with lower resolutions covering bigger areas.

Using the H3 system, within each city-size hexagon we explicitly assign large scores to selected new listings. The new listings can be chosen based on different criteria like their age in the system, historical performance, or potential competitive features. We ensure to maintain consistency in reaching a predefined target level while navigating through different geographical levels on aggregate. This approach can be implemented offline to be applicable to the static candidate generation strategy and address the aforementioned challenges.

Below is a reference to percentage of change in the fraction of new listings for different top-n in the returned ranked results, using the boost mechanism. Note that there is a special treatment in re-ranking algorithm for new listings in top-10, so the values for this segment are inflated. We can see that the treatment is boosting the new listings in different segments of the results as expected. Note that the composition of different markets is different, hence, in practice, the observed number of boosted new listings would be upper-bounded by the available new listings in each market.

+-----------------------+-----------------------------------------+
| Top-n ranking result  | % of change in fraction of new listings |
+-----------------------+-----------------------------------------+
|                    10 |                                  103.00 |
|                    50 |                                    5.58 |
|                   500 |                                    7.06 |
|                   123 |                                    10.0 |
+-----------------------+-----------------------------------------+

Conclusions

Candidate generation plays a crucial role in large-scale recommender systems and ranking. In this blog post, we outlined two approaches for candidate generation: the ML approach and the heuristic-based approach. The ML approach is preferable when historical shopping logs are accessible; however, if such logs are unavailable, the heuristic-based approach serves as a good alternative. Finally, we also explored different strategies to tackle the cold start problem during the candidate selection phase. The candidate generation systems presented here are currently in use across various brands at Expedia Group.

[1] The Juggler Model: Balancing Expectations in Lodging Rankings, Tiago Cunha, Medium blog post

[2] https://en.wikipedia.org/wiki/Deep_learning

[3] Presentation of the Expedia Group RecTour Research Dataset, Adam Woznica, Jan Krasnodebski, Proceedings of the Workshop on Recommenders in Tourism co-located with the 15th ACM Conference on Recommender Systems (RecSys 2021)

[4] Personalize Expedia Hotel Searches — ICDM 2013: Learning to rank hotels to maximize purchases, Adam Woznica, Jan Krasnodebski, Kaggle 2013 competition

[5] https://en.wikipedia.org/wiki/Conversion_rate_optimization#Calculation_of_conversion_rate

[6] https://en.wikipedia.org/wiki/Click-through_rate

[7] https://en.wikipedia.org/wiki/Learning_to_rank

[8] Bootstrapping Upper Confidence Bound
Botao Hao, Yasin Abbasi Yadkori, Zheng Wen, Guang Cheng, NeuroIPS 2019

[9] https://en.wikipedia.org/wiki/A/B_testing

[10] H3: Hexagonal hierarchical geospatial indexing system, 2023 Uber Technologies, Inc.