Measuring the business value of a hotel ranking model
Part 1 of Rocketmiles’ search result ranking series
There are plenty of articles floating around the web talking about how to code machine learning models, but not many discuss how data scientists address business objectives and communicate results to management. This article tries to fill that gap by walking through the business side of Rocketmiles’ hotel ranking algorithms; namely, how we modified a standard ranking model to maximize business value, custom metrics we used to speedily evaluate A/B test performance, and tools that allowed management to interpret a supposedly “black-box” model.
Ranking plays second (third?) fiddle to classification and regression problems, so there are not many resources detailing in depth how e-commerce companies can implement intelligent ranking solutions. By sharing our learnings, we hope we can help others make more informed decisions during their ranking journey.
We organize this post into three sections:
- What factors should a ranking model consider so that it can best add value?
- What business metrics can we use to evaluate model performance?
- How can we make sure the model aligns with common sense?
Let’s dive right in.
What factors should a ranking model consider so that it can best add value?
To determine added value of a new solution, it’s important to understand the drawbacks of the previous. Before we moved to a machine learning model, Rocket used a points-based system to rank hotels, of which a general example is displayed below.
We would then sort hotels in descending order of points, where the hotels with the highest points would go to the top. The points-based approach is intuitive, and many e-commerce companies have followed this format before. For example, in the second paragraph of AirBnB’s search paper:
The very first implementation of search ranking was a manually crafted scoring function
However, this solution has some structural drawbacks.
- Inflexible. All products come with tradeoffs, and hotel rooms are no different. Some are too small and others too big; some are too cheap and others too expensive. Though the concept of a Goldilocks zone (where the price is just right) applies, a purely linear scoring equation can’t replicate it: according to the model, a higher price is always better or always worse. Moreover, linearity forces the benefit of going from 4.0 stars to 5.0 stars to be the same as the benefit of going from 1.0 stars to 2.0 stars, so any concept of diminishing returns is absent.
- No personalization. Capsule hotels might be acceptable for the backpacker crashing the night in Singapore, but less enticing to the newlyweds in Malta.
- Arbitrary weightings. This is not a knock on our product owners, as their business sense is several orders of magnitudes higher than mine, but human judgment is simply not precise enough to solve for the optimal weights of such a complex and multifaceted problem.
- Does not factor in diversity. Two hotels which look the same will have the same score, and will be ranked closely. But no user wants to scroll past 60 Motel 6’s in a row.
More sophisticated feature engineering and weight tweaking could solve many of these problems. For example, we could add as a feature the absolute distance of the hotel price from the mean hotel price, or interact terms with a boolean flag indicating if the user is a business traveler. But if we want to make these kinds of tweaks, then using a machine to do it for us actually becomes the simpler alternative. Let’s consider the landscape of machine learning methods, and see what they have to offer.
Broadly, there are two common machine learning approaches to ranking:
- Information retrieval: descending from search engines like Google and Yahoo, in which the goal is to return the most appropriate results to a query.
- Recommender systems: descending from product recommendations like Netflix and Amazon, in which the goal is to push the items which most closely match a user’s latent preferences.
Though the dominant models of the two approaches can be quite different in structure, they have the same core principle of matching by relevancy. In information retrieval, the query requires a relevant result; in recommender systems, the user demands a relevant item¹.
There are two problems with relevancy scoring:
- Does not factor in diversity. Just like the points method, results which are similar have similar scores, and will be ranked closer together. But an enjoyable hotel booking experience should present a multitude of attractive options, where results can range from inns to cabanas. Moreover, a smart ranker should be hedging its bets in case its perception of the user is incorrect.
- Does not factor in value. When we decide how to rank, we have to account for two things: how relevant the hotel is to the user, and the impact to our business. This is a salient distinction for Rocket, since we have large variance in the value of selling different properties.
To solve for these drawbacks, we made a simple modification. The relevance score for a hotel is processed² into a value representing the value that hotel is picked, given the user picked 1 hotel from the ranking. For example, if Hyatt Regency had a prediction of 2%, that meant that if the user purchased a hotel from the listing, there would be a 2% probability the purchased hotel was Hyatt Regency.
We then multiply (“rescale”) this probability by a profitability measure so that we are now ranking the hotels by “expected profitability”. This has a nice side effect of shuffling results to make them more diverse, since similar hotels might not be giving us similar margins.
What kind of profitability measure is that?
- You, right now, and also my CFO, some time ago
You might notice our profitability metric is profit divided by the square root of revenue. This is a strange choice, since when we think of maximizing profitability, we multiply by profit (duh?).
The problem with raw profit re scaling is that it tends to float the most expensive properties to the top, even if they are not particularly relevant to the user. These hotels tend to have high profitability simply on the basis of being expensive, even if on a percent basis we might only be making pennies on the dollar. The problem here is that the raw profit is high, but the profit margin is low. What if we had a nice way to combine these two notions of profitability in our rescaling? Actually, that’s what we did!
The equation above shows that profit over the square root of revenue is just the geometric mean of profit and profit margin. Thus, we are not multiplying by profit or profit margin, but by something in the middle.
In practice, this profitability measure strikes a balance between the high-end luxury hotels and the dirt-cheap inns and motels, since the latter tends to pass along higher margins to distributors to ensure bookings.
Using a ranking correlation metric called weighted Kendall’s Tau, we can measure how much the rankings for a search request change with respect to each ranker.
Displayed above are histograms showing the ranker’s correlations on test set search requests versus ranking by raw profitability (
est_ttm), the points-based ranker (
display_rank), and the raw model score (
pred). We can see that even after incorporating profitability, we rank quite differently from a direct sort by profit. Comparatively, model scores alone are much closer to the profit-rescaled ranker for most search requests. Thus, in multiple ways this profitability metric balances out the rankings³:
- Balances between user relevancy and firm profitability
- Balances between profit and profit margin
- Balances between high end and low end hotels
- Breaks up long chains of similar hotels
Interestingly, we have not seen any literature discussing this method of reranking by expected profitability that we consider so sensible⁴.
What business metrics can we use to evaluate the model’s performance?
Heuristic: average rank of benchmark hotels
So now we have a nice ranking framework, and after some offline testing performed while sitting in our ivory towers, we cooked up some numbers (they’re called “normalized discounted cumulative gain scores⁴”) saying our search result ranking model is great. A savvy data scientist, however, might want to have a more presentable metric for management to see before they let the model out into the wild.
It turns out coming up with good ways to communicate that one ranker is unequivocally better than another is extremely difficult⁵! After pounding my head against the wall for some time, I consulted a friend on this problem.
“Oh, that’s easy,” she said.
“Just search Singapore and see where Marina Bay Sands is ranked by both of them. If it’s not top 3, it’s garbage⁶.”
As it turns out, the model put Marina Bay Sands in the first or second spot (out of hundreds) in every single applicable test set search request save one. I won’t talk about where the points scorer was ranking Marina Bay Sands, but from that moment on my belief in the machine-learned approach was unshakable.
Still, there’s some room for less subjective metrics. When we first deployed the live model, we devised one which converged within a single day called the median positive percentile rank (MPPR)⁷. Not only is MPPR a good way to communicate the effect of a better ranker to business, it also helped us uncover a number of post-deployment bugs⁸.
Main metric: median positive percentile rank
Let’s step through the terms in median positive percentile rank.
- Rank. Not much to say here. If Marina Bay Sands is at the top, it’s rank one.
- Percentile rank. You take the rank and divide it by the total number of hotels in the ranking. So if Marina Bay Sands is first of fifty hotels, it has a percentile rank of 1 / 50, or 2%.
- Positive percentile rank. For a search request which resulted in a purchase, the positive percentile rank is the percentile rank of the purchased hotel.
- Median positive percentile rank. The median positive percentile rank is the median of all the booked search requests.
If the MPPR is 37.5%, that means that half the time, users only need to see the first 37.5% of results before finding what they’d like to purchase. If one ranker has a lower MPPR than the other, then that ranker provides a better shopping experience because the user can browse for a shorter period of time before finding the desired item.
Graphed above are the daily MPPR ratios of the model over the base for a period of our A/B test. We can see that the MPPR of the model is around half that of the points based scorer, meaning users only have to scroll half as far before finding their product. Along with MPPR, we monitored more general A/B testing metrics; in particular, we wanted to make sure we saw an increase in “conversion”, or the probability a customer made a booking upon visiting the website. And it did!
How can we make sure the model aligns with common sense?
If at this point, we buy the story that this machine learning model knows something about what it’s doing, then we might want to learn a little bit about what it has to say about what a good hotel is. We’re in luck, because with SHAP values (if you want to know more about SHAP, just google one of the hundreds of blog posts written on it, or the actual paper), we can interrogate our model on every single choice it makes.
Let’s start by looking at Grand Fiesta Americana Coral Beach, which is very highly ranked. The plot above is a SHAP value force plot, which decomposes the model’s score into contributions (“SHAP values”) of the respective features examined by the model. Red features increase the score proportional to the size of their corresponding bars, implying the model thinks that feature’s value made the result more relevant. Conversely, blue features decrease the score, implying the model thinks they decreased the result’s relevance.
We can see that the largest red bar corresponds to
hotel_cumulative_share = 0.029. The model likes this, because it means that Grand Fiesta takes up 3% of our bookings in that region, which is very high. Other reasons it likes it include the relatively high number of reviews, and the amount of rewards offered to the user.
Meanwhile, the largest blue bar pushing the score down is
user_preferred_price. This feature is an estimate of the user’s preferred standardized price level; because this user is relatively cheap, the hotel slightly reduced the hotel’s relevance score.
On the other hand, the model dislikes the above hotel, placing it near the bottom of the listings. Despite being a 5 star hotel sharing the same beach as every single other resort in Quintana Roo, this hotel’s share, number of reviews, and high standardized price (“
srq_price_zscore") make it an unappealing offer.
Interrogating the model on individual results is interesting, but an aggregate view can give us a good look at the top-level trends given by the model. The summary plot below lists several top features in descending order of importance. Every point is a result; it’s position on x-axis represents the feature’s SHAP value, while the color represents the feature’s relative magnitude, with red being high and blue being low.
We can see that the most important feature is
hotel_cumulative_share, with a large red bar on the right indicating that a high share is good. Next up is
previous_user_hotel_interaction, a flag for if the user had previously browsed the hotel before. Though this flag is usually set to 0 to indicate no interaction, its impact when it’s not 0 is immense. Unlike share or review count, a lower relative price is almost always regarded as better. In general hotels with good historical performance, hotels which are closer to the user’s declared destination (if any), and hotels which are relatively cheap are all ranked higher by the model.
I want to expand on the last point, since earlier I stated that a good machine learning model should be able to find a Goldilocks zone of price for different users. Though the model says that relatively cheap hotels are better, it emphasizes that this is less true, and sometimes not true at all for users who have revealed a preference for higher-end hotels.
Above is a partial dependence plot showing the interaction between the user’s price level,
user_preferred_price on the x-axis, and the hotel’s price,
srq_price_zscore in color. A long streak of red runs roughly along the y-x line; this indicates that expensive hotels have negative SHAP values and are less relevant for cheap users, but have positive SHAP values and are more relevant for higher-end users. Conversely, the perpendicular blue streak shows the model is able to adjust its expectations for cheap users. Without any human input, the model is able to match users to hotels based on their specified price levels.
So for those of you who came here looking for insights on hotel quality, take these wisdoms from the model to heart:
- Popular hotels are better.
- Hotels closer to destinations are better.
- Cheap users like cheap hotels.
- Expensive users like expensive hotels.
Wow! Very… uncontroversial.⁹
Looking ahead: model design
By now, you know the model factors in user relevance and firm profitability; you know the model is flexible to user needs; the model itself, however, remains a black box. In the next section of Rocket Travel’s hotel ranking series, we’ll explain why we preferred an information retrieval model called LambdaMART over the standard recommender systems models used for ranking.
 The difference between information retrieval and recommender system models is that recommender systems do not require the user to type in anything to “search” the result space. In this sense, recommender systems can be considered “zero-query search”. Under this lens, we consider recommender systems as information retrieval models where user attributes have been substituted as the “query”.
Though the philosophical differences between the two approaches are subtle, the machine learning methods each have spawned are wildly different. But I’ll discuss this more in part 2.
 We will discuss how we process relevance scores into “preference probabilities” in the next section, but the processing is just the softmax or exponential operator (they are interchangeable for the use case).
 There is actually also a decent theoretical justification for using the geometric mean of profit and profit margins. Not to beat a dead horse, but that will have to wait till the next section as well!
 There are many ranking metrics that could be used to evaluate a ranker, NDCG being one of them. In particular, mean reciprocal rank is very similar to MPPR. The problem is that these metrics are hard to understand, and do not necessarily converge quickly during an A/B test.
 Try coming up with a method or metric to prove one ranker is better than another yourself — there are a few other half decent ones. It’s a nice thought exercise.
 She didn’t actually say “it’s garbage”.
 Median positive percentile rank is actually called the median booked percentile rank (MBPR) within the company, but on the off chance people start using my terminology, I would like MPPR to be applicable to all of e-commerce and not just hotel bookings.
 I casually said “raw profit rescaling prioritizes expensive hotels” and “there’s some room for better metrics”, but most of this stuff I’m skating over was experienced the hard way. We’ll talk about a couple more egregious mishaps in the fourth section: quality assurance and monitoring.
 This is rather unfair to the model, since whenever it did make controversial statements, I went and debugged it.