How We Optimized Hero Images on Hotels.com using Multi-Armed Bandit Algorithms
Introducing the multi-armed bandit (MAB) optimization used for hero images on Hotels.com
In this entry, we introduce the multi-armed bandit (MAB) optimization campaign we have recently run on Hotels.com™. The objective was to find the best possible main image (also called ‘hero image’) for the properties on our website whilst minimizing the impact of exploring different candidates. We will go through some of the main considerations and outcomes of this project. We also plan to publish further deep dives going forward.
A Key Use Case
Images are typically one of the most important aspects to optimize for many web businesses as their customers do not physically interact with a product before purchasing. Often, the best impression of an item users can get is an image. This is especially important for Hotels.com customers as images are key when choosing a property to stay at, which is driven by a balance between the aesthetically appealing and the practically relevant.
However, the large number of images available in our site makes it untenable to manually choose the best images to show to our customers. For this reason, our team has decided to focus on developing computer vision capabilities.
Over the last 3 years, the team has worked on computer vision algorithms to score property images according to their aesthetic values. The data was obtained through Amazon’s Mechanical-Turk campaigns and previous A/B tests. This has proven to be an extremely powerful way to re-rank the images in the gallery and to select a new hero image, leading to two successful tests.
Although supervised approaches generalize well, they are impacted by how the data has been generated. Indeed, the mechanical turks do not provide the same feedback as our customers and the historical data can be subject to sampling bias (due to the presence of pre-existing business rules when it was generated). To complement the current approach and alleviate those issues, we opted to use multi-armed bandit (MAB) algorithms to explore new images using our customers’ feedback. More specifically, we were interested by this natural ability to quickly deactivate suboptimal options while focusing the traffic on more promising ones.
A Multi-Armed Bandit Approach to Optimize Images
MAB algorithms are a well-known and well-documented approach that has shown to be an excellent optimization tool. In summary, MABs are used to explore a set of options with unknown expected outcomes. The objective is to explore each option enough to detect which one maximizes the expected reward, while limiting the amount of regret (opportunity cost) from exploring suboptimal options. This approach is often compared to A/B testing and does share some commonalities. However, it differs in its ability to change the traffic split in order to minimize the cost of exploring (aka regret).
The image optimization use-case is a natural application for this approach. Each property is treated as an independent MAB problem in which each candidate image is an arm. The objective is to find the best image while minimizing the exposure of poor ones. We define the “best image” as the one that maximizes the click-through rate (CTR) of the property on the search page.
The approach can be summarized in 4 key points:
- For each property we select up to 10 highest ranking images from the existing image ranking algorithm. We enforce some diversity on the images categories (e.g., room, lobby, exterior, etc.). We limit the number of candidates as some properties have a large number of images, which would require significantly more traffic to converge.
- To balance between exploration and exploitation, we use the Thompson Sampling algorithm on the property’s CTR, defined as the click to impression ratio. Each time a property image is requested, we use the candidate images current CTR to sample a score from a Beta distribution, selecting the one with the highest score.
- We run an independent bandit for each property — resulting in thousands of simultaneous optimizations.
- Once an image has been selected for a user, it will remain the same for this property for the rest of the session and across the website. We believe it’s important to maintain this consistency to avoid confusion.
An Infrastructure using Streaming Technologies
Implementing a MAB algorithm presents an interesting infrastructure challenge. We needed a fast and resilient infrastructure that can ingest millions of user feedback messages per hour. We worked closely with our Machine Learning (ML) engineering team to put together the three main components for a MAB infrastructure. Namely:
- Parameter Store: The central piece of the infrastructure is the database that holds the current hits and misses for each image. This parameter store needs to be queried in real time for each page request so it requires low latency.
- Sampler: This is where the Thompson Sampling algorithm occurs. The sampler receives a request from the web application with a list of properties, fetches the images clicks and impressions, and samples one image per property.
- Streaming Engine: In order to ingest the user feedback from a common streaming tracking queue, we use a streaming engine that will filter, de-duplicate, and send increment messages for the clicks and impressions in the parameter store.
The main advantage of this approach is the decoupling between updating the algorithm and sampling random images, making the whole system more resilient.
Training and Testing Multi-Armed Bandits as a Campaign
We decided to run the Image Bandits as an optimization campaign. A campaign is a time-limited optimization paired with an A/B test. We limited the bandits in time as adding new images to an ongoing bandit is not a trivial task. Running periodic campaigns also enables us to refresh, restart and improve the approach between iterations.
Another reason to run the bandit algorithm behind an A/B test is that the Key Performance Indicator (KPI) we use to optimize the image (CTR) is different to the KPIs on which we evaluate the overall business performance. In order to monitor the latter we run a test that holds out a certain amount of traffic.
The campaign contains two phases:
- Phase I is the exploration phase. Most of the traffic is exposed to the bandit and a small holdout to a static control.
- Phase II is the exploitation and testing phase. We run a classic A/B test to validate the success of the campaign.
Exploring New Images
In Phase I, we use the bandit to explore the pre-selected candidates. As the algorithm has no information in the first week, it will randomly sample from the list of candidates. As expected, some sub-optimal images will be shown to users leading to a weaker overall performance. The advantage of the bandit algorithm lies in its ability to quickly discard poor options and repurpose new traffic to more promising images, organically converging to better options. To validate this hypothesis we hold out a small percentage of traffic to monitor the impact on the metrics and page KPIs we are optimizing.
The figure below shows the evolution of the metric of interest over time when comparing the bandit vs control. It highlights the bandit converging, initially displaying worse performance than control and then improving as time progresses, leading to an optimal solution.
After running the exploration phase for a month, we follow a conservative approach to select new images. We only choose a new image if: 1) It has the highest CTR for the property and 2) we can see a statistically significant improvement over the current featured image. Otherwise, we fall back on the featured image.
Impact and Next Steps
Phase II objective is to validate that this set of new images selected from the exploration phase improves our business KPIs. It consists of a classic A/B test with an equal split (Variant ‘A’ being users exposed to the current control image; ‘B’ being users exposed to the new images). This phase ran for a couple of weeks and the results exceeded our expectations: the bandit led to not only to an increase in progression, but also to improvements in our main business KPIs. Additionally, users bounce rate dropped and general user engagement increased.
While this first iteration can be considered a success, we highlight some limitations:
- The click-through rate has proven to be a good proxy for overall customer satisfaction. However, it can be biased towards images that could be seen as funny or controversial, which attracts clicks but often doesn’t lead to purchases. In other words, this ‘clickbait effect’ leads to some images being chosen while not improving our customers’ experience. We plan to tackle this problem in two ways: firstly, start incorporating direct feedback from our customers (both travelers and property owners), so they can inform us or directly blacklist images. Secondly, using multi-objective metrics that balance more aspects than simply click-through rate. One option can be found in Drugan et al. .
- There is lack of contextualization, as the optimization is happening on a general level. One might want to show different images for different types of customers (e.g., different points of sale and traveling parties). This requires a more complex modelling similar to the seminal approach from Li et al..
- Finally, running the bandit algorithm on the properties with low traffic (long tail) hasn’t shown conclusive results. There is a need for a more general model that will learn general preferences regarding the content and the quality of the images. This could be explored using deep learning in combination with bandits. Riquelme et al. present some promising approaches.
This project has involved numerous teams across Expedia Group™ and we want to thank the many people involved. We especially would like to mention:
Computer Vision/Machine Learning: Vasilis Vryniotis
ML Engineering: Carlos Peña, Gyula Magyar, Bela Tanczos, David Mogyorosi
Hotels.com Product: Humiun Miah, Hannah Wylie, Robert Griggs
 Madalina M. Drugan, Ann Nowe “Designing multi-objective multi-armed bandits algorithms: A study”, The 2013 International Joint Conference on Neural Networks (IJCNN), 2013.
 Lihong Li, Wei Chu, John Langford, Robert E. Schapire, “A Contextual-Bandit Approach to Personalized News Article Recommendation”, Proceedings of the 19th international conference on World wide web, pp. 661–670, 2010.
 Carlos Riquelme, George Tucker, and Jasper Snoek, “Deep Bayesian bandits showdown: An empirical comparison of Bayesian deep networks for thompson sampling.” In International Conference on Learning Representations, 2018.