Minimizing technical incidents during A/B testing at scale

How Coupang automated code review to detect changes in A logic

Coupang Engineering

Published in

Coupang Engineering Blog

8 min readApr 7, 2022

By Ohmer (Liangwei) He

This post is also available in Korean.

If you’ve ever wondered why the Coupang app looks the way it does, there’s a surprisingly simple answer: it’s all about A/B testing. At Coupang, hundreds of A/B test are run every day, rigorously testing every feature on our application to enhance the customer experience. As an integral part of our decision-making process, A/B tests allow us to empirically approach business issues without having to rely on individual voices and gut feelings.

What exactly is A/B testing? It is an experiment with one control group and one or more experiment groups. The users in the control group have the same default features, or A logic, as all other Coupang users. On the other hand, the users in the experiment groups are given alternative treatments where a new feature, or B logic, is being tested and measured using feature-specific metrics. A/B tests allow us to measure user acceptance of features to optimize the Coupang experience. We use our own in-house experiment platform to customize metrics and conduct A/B tests. To find out more about our experiment platform, check out our previous post.

In this post, we will discuss the issues we faced running A/B testing at scale and how we automated a portion of the code review process before running A/B tests to minimize errors and maximize engineering efficiency.

· Issues
· Technical implementation
–Overview
–AST comparisons
–Operational analysis
· Conclusion

Issues

One of the most important roles we have as engineers at Coupang is minimizing technical interference on app performance and ensuring app stability while running hundreds of simultaneous A/B tests. However, no matter how careful we are, incidents related to A/B tests inevitably occur.

When an incident arises due to a certain A/B test, we immediately roll back the B logic and revert all users to A logic. However, this is easier said than done. We found that 80% of our technical incidents related to mobile applications was due to A/B testing, but we could not solve the incident by reverting to A logic. The issue was that A logic and B logic were not properly separated through A/B conditional judgments, which meant turning off B logic could also cause unexpected changes to A logic.

Most of these errors were completely avoidable and caused due to unintentional mistakes. It could be argued that such simple errors should not have passed the code review, but detecting whether each change in the code affects A logic is a time consuming and difficult task to conduct manually. Also, our engineers are constantly developing and deploying tens of different versions of features each week for A/B testing, and it simply wasn’t possible to catch all errors.

To provide an efficient prevention measure, we designed and implemented an automated system to detect changes in A logic during the code review process before running the A/B test to increase stability and robustness.

Technical implementation

**Figure 1.** A PR triggers the detection solution CI job, which comments on code that may change A logic.

We integrated the A logic change detection system to be triggered with pull requests (PR). When mobile engineers submit a PR, the detection system analyzes the code and dispatches warnings on code lines that can potentially change A logic. Both the engineer and reviewer can view the warning during code review and revise accordingly. By automating the detection process, our integrated detection system saves our engineers time, increases efficiency, and minimizes the risk of human errors.

What’s happening behind our simple detection system? In this section, we will discuss how we built the detection process and what goes on behind the scenes.

Overview

Overall, the detection process can be divided into four steps:

Detect code changes at Git level. The system first compares the original and changed codes on the Git level to identify code lines that have been changed. The changed code lines are classified as addition, modification, or deletion.
Compare code changes through AST. Then the changed code lines are compared using the abstract syntax tree (AST). The AST has four types of nodes: addition, modification, movement, and deletion. We use the AST parser to categorize each node to one type and to map each node to its corresponding node.
Analyze operational changes. Then the system uses predefined rules for each operation type to analyze the ASTs and determine whether any changes in the node may trigger a change in current A logic.
Filter out nodes with that can change A logic and select users to alert. Nodes with changes that may affect A logic are tagged. Because operations have complex parent-child relationships, even nodes that do not directly change A logic may be tagged if it has a child that does. The system automatically filters out such nodes with reasonable granularity to alert users of only the nodes with A logic changes.

A diagram of the overall code change detection process. — **Figure 2.** A diagram of the overall change detection process.

AST comparisons

Let’s look at the detection process in more detail by going through an example. Imagine the code under review has the below ASTs of operational changes. The AST on the left is of the original code and the AST on the right is of the changed code that has called the PR. The changed branch file has undergone the following changes:

Movement: Node I was moved, such as an expression inside a code block has moved outside.
Modification: Node M was modified, such as a new method has been called.
Deletion: Node Z was deleted, such as a code line has been deleted.
Addition: Node P was added, such as a code line has been added.

The abstract syntax trees (AST) that represent the operations of the branch of the A logic and the changed code with B logic — **Figure 3.** The ASTs that represent the operations of the branch of the A logic (left) and the changed code with B logic (right)

How does our detection system categorize such changes into the four operation types? First, the system maps the relations between the nodes of the original code and that of the changed code using comparison algorithms that check for equality and similarity.

If the original and changed nodes are the same, the node is regarded as unchanged and will not be analyzed further on in the process. If the original and modified nodes are equal but the parent node has changed, then the new node’s operation type is categorized as movement. If they are not equal but similarity exists, the type is defined as modification. If there are no mapping relations between the original and modified nodes, the node is classified as deletion or addition.

A comparison of the abstract syntax tree (AST) nodes suing equality and similarity comparison algorithms — **Figure 4.** The nodes of the original and modified ASTs are first aligned (top), and then their relations mapped using equality and similarity comparison algorithms (bottom).

The AST trees after the algorithmic comparison looks as below in Figure 5. The nodes highlighted in yellow represent modification operations while the green represent movement operations. Note that even nodes that are not directly changed such as A and B are also highlighted in yellow because some of their children have undergone modification. The red represents deletion and cyan represents addition. The blue represents no change.

The abstract syntax trees (AST) of the original node and the modified node marked by operational type. — **Figure 5.** The AST of the original node (left) and the modified node (right) marked by operational type.

After determining the nodes that have operational changes, these changes are analyzed to tag nodes that might change the A logic. This analysis process is detailed in the next section. Below, the nodes with possible changes to A logic have been circled with a dotted line.

The abstract syntax tree (AST) with circled nodes that have undergone changes that may change A logic. — **Figure 6.** The circled nodes are those that have undergone changes that may change A logic. However, nodes without any direct changes are also circled due to complex parent-child relationships.

As seen in Figure 6, even nodes that have not directly been modified, such as Node A, are circled because these nodes have children that may change A logic. If we inform users of all the circled nodes from both ASTs, users will be inundated with unimportant and confusing messages. Therefore, we filter out the nodes with reasonable granularity to inform users of only the detected operations that actually may cause a change A logic, as shown in Figure 7.

The final abstract syntax tree (AST) where the nodes to inform the users of are filtered out. — **Figure 7.** The final AST where the nodes to inform the users of are filtered out.

Operational analysis

In this section, we will detail how the detection system analyzes each node that has been tagged with an operation type other than unchanged to determine whether it changes A logic or not. Let’s look at an example of a node that has the addition operation type.

When parsing an addition node, it is first categorized as including declaration syntax or expression syntax. Declaration syntax includes class declarations, method declarations, attribute declarations, and more. If the new declaration is not called anywhere in the code, it cannot change A logic and is classified as not changing A logic.

In cases where the node has expression syntax, the node is categorized as either a conditional expression or not a conditional expression. If it is a conditional expression that affects the judgement of the new node, it is assumed to relate to B logic, not A logic, and thus safe. Note that this is a base assumption that can be defined differently depending on actual deployment environments. However, if the conditional expression affects judgements of other nodes, the new node is tagged as changing A logic.

If the node has expression syntax but is not a conditional expression, it must be protected by B logic conditional statements. For example, the expression statement must only be executed under “if” or under other conditional statements that are called only during B logic. If the expression is not protected by such a conditional statement, it is tagged as changing A logic.

This basic logic is used not only to judge addition, but also to judge all other types of operations, apart from the movement operation. For the movement operation, we need to determine whether the execution condition and execution path of the condition are consistent before and after the movement. Our detection system can even compare complex logic combinations that occur when multiple execution conditions are combined. For instance, the equivalence of conditional judgement between (a || b) && c and (a && c) || (b && c) can be considered by the system.

Conclusion

After applying this automated code review tool, we saw a visible increase in condition checks for A/B test codes, which reduced code violations per thousand lines from 22.6 to 16.5, a decrease of around 27%. Overall, we also saw a 30% decrease in changed A logic. With the automated code review, our engineers spend less time and cost reverting unstable customer experiences back to enjoyable ones. We hope to test this simple yet effective solution on development outside of the Android environment, such as during backend and iOS development.

If you’re interested in contributing to mobile application stability at a company that has a fast growing customer base such as Coupang, check out our current job openings.