Enabling 10x More Experiments with Traveloka Experiment Platform
By: Irvi Aini (Data Engineer), Cakra Wardhana (Data Engineer), Marcellus Reinaldo Jodihardja (Data Engineer), Adisetyo Panduwirawan (Data Scientist), Novia Listiyani Wirhaspati (Data Scientist) and Chong Jie Lim (Data Scientist)
Split traffic experimentation, more commonly known as A/B Testing, is the “gold standard” to compare the effectiveness of new features in product development and in The Surprising Power of Online Experiments, Harvard Business Review wrote about the positive impact of experimentation in driving revenue growth in Bing.
What is A/B Testing? Let’s say you have two designs of your product landing page and you want to know which design is better in terms of bringing in conversion. There are a few ways to “test” this. One way is to use the first version for a week and switch to the other version in the second week and compare the conversion rate using statistical tools. However, the two groups of audiences coming to your landing pages may not be comparable since the landing pages were shown during different time points. That’s where split traffic experimentation comes in. When a user arrives at the landing page, we can randomly show them one of the designs, record the design shown, and the conversion status. After the pre-defined duration, we compare the two versions and use the better performing one for our landing page. This split traffic experimental setup, known as A/B Testing, is used by various companies to improve their products and services. Traveloka saw the value of split traffic experimentation and developed its first Experiments API prior to 2018.
The First Version
The first version of the Experiments API had a fixed set of predefined filters to scope the target audience for experimentation such as client interface type, application version, and country of origin. This makes setting up a new experiment quick and easy. Additionally, each deployed experiment returns a single parameter to identify the version of the feature to be exposed to the user.
How does the API return the treatment group of a user? Whenever the API is called, it uses a “look up” approach using a caching mechanism to check for any existing assignment for the user and assigns a random treatment to the user if no existing assignment is available.
With the increasingly wide product breadth and the requirement of rapid product development, we need to run 10x more experiments without 10x the effort. Three immediate challenges come into the picture:
- Scalability.
- Product context specificity.
- Treatment parameter hygiene.
Scalability
What happens when we would like to test the impact of a new design for our Flight search results on conversion and the Payments team also wants to try out a new design on the payments page? Since these experiments were likely to interfere with one another, using the now-decommissioned API, we had run these experiments sequentially. This slowed down experimentation within the organization and yet the ability to run experiments in parallel without interfering with one another is a critical feature to scale experimentation.
Product Context Specificity
Different products have product-specific context to scope the target audience and having to make changes in the product code to accommodate experiments for the specific audience can cause the code to be less maintainable. Therefore, allowing product teams to pass product-specific context for the API to process is an important feature to have.
Treatment Parameter Hygiene
An experiment involving UI design change might require more than one parameter; each representing a change in the design element such as placement and size. With the now-decommissioned API, product teams had to specify in a single parameter both the placement and the size values. For example, there are two possible placements: top and bottom and three possible sizes: small, medium, and huge. Since there is only one treatment parameter, one way to implement such factorial experiments is to concatenate the values for each design element such as top_huge, instead of returning one parameter top for placement and another parameter huge for size. This also causes the product code to be less maintainable. Therefore, allowing more than one parameter in the API output is a good feature to have.
Introducing the New Experimentation Platform
Our new Experimentation Platform, EXP, not to be confused with Microsoft’s ExP, was built in 2018. It uses Facebook’s Planout framework and referenced PlanOut4J, a Java-based implementation of Planout by Glassdoor for our Ops Backend. The key advantages of Planout over the first API are:
- Ability to process product-specific context as long as the product team sends the context.
- Ability to partition the user traffic and assign each partition to an experiment, thereby allowing us to run multiple experiments in parallel.
- Uses hashing to provide a platform-agnostic and deterministic pseudo-random assignment, thereby removing the need for a “look up”.
Components
There are three key components in the design of the EXP Platform:
- Ops Frontend: For users to set up Experiment Configuration through Planout script.
- Ops Backend: Compiles Planout Script and pushes compiled Planout config to data warehouse. Also provides an API to retrieve the latest compiled Planout config for specific experiments.
- Client Library: Syncs with data warehouse via Ops Backend API to retrieve the latest compiled Planout config, calculates treatment group on the client using the compiled Planout config and sends logs asynchronously. Since every client gets the same complied Planout config from Ops Backend API, the computation of the treatment for a user will be consistent as long as the critical user context is the same during the treatment calculation. This workflow avoids having to look up existing treatment assignments in the previous approach of assigning the treatment on the server and sending it to the client, and the only communication between the client and the EXP Platform is in the fetching of the latest compiled Planout config.
Other Components in the Pipeline
With EXP in place, the next areas to scale Experimentation in Traveloka are the Analytics and Statistical components: pre-analysis, monitoring of experiments progress and post-analysis. The Experimentation Team has built an automated analytics component on top of EXP to monitor running experiments and a standalone pre-analysis component for our users to determine the length of the experiment.
The Experimentation Team has also launched a Multi-Armed Bandits service using the EXP design to enable even faster experimentation in Traveloka.
Finally, in collaboration with our front-end infrastructure team, we also have released our internal EXP Client libraries for Android, iOS and web product engineers.
We will be sharing more about the pre-analysis, automated analysis and multi-armed bandits components in future posts.
If you are interested to be involved in empowering Experimentation in Traveloka, visit Traveloka’s career page to discover your next journey!