Auto Experiment Merge: How we automate experiment integration and de-integration

Published in

Agoda Engineering & Design

8 min readAug 11, 2023

Be a scientist, experiment, and measure is one of Agoda’s core values, leading to a considerable number of A/B experiments taking place. These experiments provide insights that help us choose between different paths, A or B.

However, this scenario leaves us with a challenge — the code associated with the option not chosen becomes redundant, requiring our software engineers to remove it manually. Given the scale at which we operate, the cumulative time spent on this task can consume a substantial amount of valuable developer time.

The “Auto Experiment Merge” project aimed to solve this issue by minimizing the manual effort involved in the cleanup process, using an Abstract Syntax Tree-based source refactor tool and a combination of open-source and in-house developed tools to create an end-to-end pipeline.

In this article, we will discuss our attempt to automate experiment cleanup. We will begin by identifying the challenges that led to this solution, followed by a detailed walkthrough of the auto experiment merge project, explaining how it simplifies experiment integration/de-integration and optimally utilizes our engineers’ time. Moreover, we will discuss prospective advancements of the tool.

An Introduction to A/B Testing

A/B testing is a method used to evaluate the impact of changes made to our product by assigning users at random to two distinct groups, presenting varying behaviors, and subsequently examining the interactions to pinpoint disparities in the desired metrics.

At Agoda, we use a custom-built experiment platform and client libraries to conduct A/B testing. A typical experiment implementation might resemble the example provided below. In some instances, a single experiment could extend across multiple repositories.

The Leftover Work After A/B Testing

After each experiment concludes, implementing one variant becomes obsolete, requiring developers to perform a cleanup. Cleaning up the remaining code is crucial for minimizing technical debt and maintaining a healthy codebase. Nevertheless, this process demands time and effort. The Tech team spends approximately 1200 dev days on this task each quarter. The key focus of this project is optimizing the time spent on these duties.

A Decision to go with AST-based Source Editor

During our search for the ideal solution, initially, one of the teams was working on this problem using large language models (LLMs), specifically GPT-3.5. The concept involved providing cleanup instruction as a system prompt and source code as user input. The expectation was that it would generate a modified version of the input, which would then replace the original code. However, this approach encountered inconsistencies between runs, as the output sometimes contained invalid code due to syntax errors.

Therefore, we continued our search and ultimately opted for an approach more closely related to the nature of source code: editing Abstract Syntax Trees. Our mission was to remove the experiment code and tests, ensuring zero dev effort was needed to clean up.

Discovering Polyglot Piranha

We discovered an open-source tool called Polyglot Piranha. Polyglot Piranha is a tool designed specifically for refactoring source code. This tool executes predefined editing processes on the tree.

One of the challenges we encountered with Polyglot Piranha was that it lacked support for Scala, the language we use most. To address this, we investigated how to add support for a new language. We found that Piranha relies on Tree-sitter, a parser generator with pre-defined grammar for many languages, including Scala.

With this discovery, Tree-sitter’s existing support enabled us to quickly add Scala compatibility to Piranha. The process is as simple as adding tree-sitter-scala as its dependencies, then modifying Piranha’s model (language.rs) to utilize the package to allow for Scala functionality.

Deciding on the Rules

Think of Piranha as a find-and-replace tool that can edit any part of the source code given predefined instructions. To use it for experiment cleanup, we needed definitive and reusable rules. Our objective was to develop a coding pattern that would enable the tool to perform its functions efficiently.

We did this by observing existing experiment implementations in repositories with active contributors. One of the issues we found is that while we have a shared experiment library, each repository tends to have its wrapper class to add repo-specific functionality, such as logging or overriding variants. This creates a challenge since we cannot do find-and-replace based on class names and methods as defined in the shared library directly.

Using Wrapper Function

After several discussions, we developed the concept of requiring a “wrapper function”. This straightforward function calls the experiment manager and returns true for the B variant and false otherwise, providing flexibility for extra code like logging, which should be removed entirely once an experiment is complete.

We also used an explicit approach to determine which wrapper function belongs to which experiment, inserting an additional comment in the format of // AutoExp:EXP-1000:Wrapper , where EXP-1000 is the experiment name. We chose this comment approach for the first version due to its simplicity and language-agnostic nature. The latter is of great importance, as Agoda has adopted multiple programming languages for its services — having the ability to transfer one pattern from one language to another eases the use of the tool.

Using this function, developers can call it anywhere in the source code, such as in an if statement, to decide on the action or user experience associated with each variant.

When it is time for cleanup, the Piranha cleanup process involves identifying the wrapper function name, substituting the associated function calls with “true” or “false” based on experimental outcomes, and processing additional cleanup rules such as boolean and if-statement simplifications. Optionally, it removes the experiment constant defined inside an object or as an enum.

The Need for Cleaning Test Cases

Leaving unused test cases after the experiment code is removed will cause the test suites to fail due to missing implementation. In other words, if variant B is chosen, all test cases associated with variant A must be removed since the implementation they are testing will no longer exist.

Automating the cleanup of these tests in addition to the experiment code would eliminate manual work even further.

Handling Unit Tests and Integration Tests

The challenge in this task lies in determining which test case is for which variant.

During our discussion, a few solutions arose. On the one hand, we could detect each test case’s corresponding variant through the use of mocks, as test cases of a particular variant would have to force the experiment manager to return a specific variant. On the other hand, we came up with the idea that a simple annotation using comments would suffice.

While the first approach of detecting mocks seems better since it requires no additional lines of comment, covering all possibilities at the start appeared daunting. We decided to use a comment-based approach similar to the experiment code.

See the following test cases for examples. If the experiment is taken, the first test case will be removed, and vice versa.

Integrating it as a Pipeline

After discovering the inconvenience of running the cleanup script directly and the time-consuming process of installing multiple dependencies, identifying associated repositories, and executing language-specific tools, we decided to develop a streamlined pipeline. This solution effectively addressed these various inconveniences.

We developed the “Auto Experiment Merge” pipeline as a containerized Python script. This end-to-end pipeline leverages our internal experiment detector API to locate related Git repositories using Gitlab API, clone them, and carry out cleanup operations. We also incorporate various language-specific tools based on the presence of their respective configuration files.

For Scala, we integrated scalafix and scalafmt to improve code quality. After completing these tasks, the script submits a merge request to our source control platform and adds relevant tags, such as “ready to review,” once the CI checks turn green. Developers can always make additional changes to the results generated by the pipeline.

The entire pipeline runs within a container, allowing for a single command execution. Users only need to provide the experiment name and chosen variant as input parameters.

Supporting Tools: Linter and Monitoring Dashboard

While we have a functional pipeline, we anticipate two challenges upon its adoption: adherence to pipeline conventions and measuring effectiveness.

To tackle the first challenge, we have developed a linter. The linter’s main purpose is to identify unintentional mistakes during experiment implementation, such as comment syntax issues or improper use of wrapper functions for determining experiment variants. Since this process is language-specific, we began with Scala, the predominant language used at Agoda.

Measuring Success

The other challenge, measuring effectiveness, is addressed by developing an internal dashboard. We used Apache Superset, an open-source data visualization platform. We monitor the success metric, which is the number of MRs generated by the pipeline that gets merged while considering change requests and CI success rate.

Regarding overall adoption, we track the number of experiment implementations adhering to the convention, the number of stale MRs, those pending review, and those awaiting merge. Ideally, the goal is to generate as many MRs as possible that pass CI checks and are approved without change requests, saving developers time.

Future Scope

In the future, we plan to enhance this project through various improvements. First, we are focusing on expanding the pipeline capabilities. This involves integrating with additional tools, encompassing open-source and internally developed solutions from other teams.

Another aspect entails incorporating new features into the pipeline, such as automated MR updates on merge conflicts, notification to the team if the output code failed to compile, and automatic pipeline triggering upon experiment closure.

Conclusion

Through a series of open-source and in-house developed tools, we have created the “Auto Experiment Merge” pipeline. We anticipate this pipeline will help reduce the effort required for experiment integration/de-integration and encourage motivation to clean up pending experiments.

Acknowledgments

Special thanks to Kumar Ankit and Thanayut Tiratatri for contributing to this blog.