A/B testing to improve recommender products
In this blog post, you’ll read about online experimentation, why it’s key to measure the impact of recommender models on our marketplaces and you’ll know more about the different types of experiments we lead. We’ll also deep dive into our journey that started in 2020 with the creation of an infrastructure to support online experimentation, followed by the achievement of the first batch of A/B tests. Keep reading to find out more!
The Personalisation team is responsible for providing recommendations as a product to Adevinta’s marketplaces and improving user experience through data and Machine Learning. In 2020, the team created the infrastructure to support online experimentation and was able to complete the first batch of A/B tests. One of the team’s ambitions for 2021 is to streamline experimentation to converge towards a data-driven approach to product improvement and innovation.
Why experimentation in the first place?
Experimentation is essential in any product lifecycle, from improving the integration of our products with new marketplaces to testing improvements such as new recommendation algorithms.
In the personalisation context, there are two generic use cases for using experimentation:
- Testing a new integration — to assess the impact when we first integrate a recommender into a marketplace. We compare the activity and conversion metrics with and without the introduction of a recommendation carousel. We’ve had integrations with several of Adevinta’s marketplaces, like Willhaben, Segundamano.mx and Subito, where this approach was used to measure recommendations impact.
- Testing a new improvement — to validate new features and improvements in the product, such as new algorithms, new parameterisations of existing algorithms or optimisations in the infrastructure.
Some examples of A/B tests conducted on our marketplaces
So far the team has successfully used the experimentation setup for the following A/B tests.
Willhaben’s improvement of a new user-based recommender
Willhaben was integrated with the (offline) batch user-based recommender. As an offline algorithm, it didn’t take into account the latest interactions of the user, suffering from user cold start problems due to a lack of freshness. Last year, the team built a new online recommender based on user profile real-time generation. Before migrating to this new recommender, an A/B test has been conducted to measure the improvement in reactiveness to recent user interaction events.
Segundamano.mx’s experiment comparing three different related-items recommenders
Segundamano.mx wanted to assess the performance of different related-items recommender algorithms provided by the Personalisation team, so an A/B/C test has been carried out to compare the following three variants:
- A Collaborative Filtering (CF) approach, based only on behavioural data (user interactions)
- A Content-based (CB) approach, based only on ad content features such as title, description and price
- A Hybrid approach, combining recommendations of both variants following a backfilling strategy where CF is the primary variant and CB the secondary
Suggesting new favourite ads at Subito
On online marketplaces, ads are ephemeral, they disappear once the items are bought or removed by the seller. This is why even when marking items as favourites, users might need to perform another search if the items in question have been removed.
Subito wanted to validate the hypothesis that there’s interest among their mobile users to receive ads similar to their favourite ones when those are deleted. In this context, our recommender could ease the search process; so in this experiment we expected to see an increase of Subito’s conversion metric.
Experimentation infrastructure is designed to support experimentation in the following ways:
- The marketplace is agnostic from the experimentation setup: this means that the marketplace doesn’t need to do any work to set up an experiment, everything is done in the recommendations backend.
- Seamless roll-out of winning configuration: once an experiment has ended, backend is able to roll out the winning option without any further work from the marketplace’s side. This means the marketplace can benefit from the product improvement without any effort on their side.
- Support for testing multiple integrations of our products within the same marketplace at the same time: one marketplace can use the same API in multiple integration points that may be treated independently.
Houston is Adevinta’s team in charge of the experimentation platform. Their mission is to provide an easy-to-use, tailored and trusted solution that enables the rest of the teams in the company to make data-driven decisions through experimentation. In particular, they provide the functionality needed for setting up experiments, including user assignment into variants, calculation of key metrics and a visual UI for managing experiments.
In order to support experimentation, two components were added to the Personalisation Team’s architecture. Before, requests from marketplaces were directly handled by the APIs: an API gateway and an A/B microservice. As seen in the image below, the API gateway is now the client-facing API, responsible for handling the marketplace requests, whereas the A/B service is responsible for handling the experiment.
AB Service is based on Spring and is integrated to Houston SDK, which allows it to synchronise both experiment configurations and user allocation from/to the Houston server. As shown in the flow diagram in Image 2 below, it’s checking if there’s an active experiment, calculating variant assignment and building the final path.
API Gateway (GW) is a lightweight solution built on Node.js and Express.js that is able to proxy the requests to the APIs:
- Directly to the API if there’s no active experiment that applies to the request
- Modifying the request according to AB Test Service response
A/B Tests in the Personalisation Team
A/B experiments are based on modifying incoming requests before routing them to the APIs, as seen in Image 1. This modification is effectively selecting which algorithm or recommender should produce the recommendations for the current user, according to the configuration of the experiment.
AB Service allocates the user in one of the variants of the experiment and builds the path that will be proxied by the GW.
AB Service builds the path by:
- Selecting the API and endpoint to redirect to, which allows us to compare different products
- Adding optional parameters that may modify the behaviour of the API
When an experiment ends there’s a winning variant that should be put in production for the marketplace. Because default routing to the winning variant is set directly in the GW, the marketplace doesn’t need to change their request in any way.
Routing in the GW is flexible and accommodates multiple touchpoints or integrations in one single marketplace. In this case, the marketplace traffic is routed to the content-based treatment for deleted ads and to the hybrid treatment for general.
In 2020, we made a big investment to have the infrastructure do online experiments at scale. It’s been a long journey with many meetings and fruitful discussions on embracing new technologies. Today we can proudly say this is one of the most robust parts of our infrastructure and running online experiments is becoming easier and easier.