Practical A/B Testing

Published in

Zulily Tech Blog

11 min readApr 6, 2018

A/B testing is essential to how we operate a data-driven business at Zulily. We use it to assess the impact of new features and programs before we roll them out. This blog post focuses on some of the more practical aspects of A/B testing. It is divided into four parts. It begins with an introduction to A/B testing and how we measure long-term impact. Then, it moves into the A/B splitting mechanism. Next, it turns to Decima, our in-house A/B test analysis platform. Finally, it goes behind the scenes and describes the architecture of Decima.

A/B Testing Basics

In A/B testing, the classic example is changing the color of a button. Say a button is blue, but a PM comes along with a great idea: What would happen if we make it green instead? The blue button is version A, the current version, the control. The green button is version B, the new version, the test. We want to know: Is the green button as awesome as we think? Is it a better experience for our users? Does it lead to better outcomes for our business? To find out, we run an A/B test. We randomly assign some users to see version A and some to see version B. Then we measure a few key outcome metrics for the users in each group. Finally, we use statistical analysis to compare those metrics between the two groups and determine whether the results are significant.

Statistical significance is a formal way of measuring whether a result is interesting. We know that there is natural variability in our users. Not everyone behaves exactly the same way. So, we want to check if the difference between A and B could just be due to chance. Pretend we ran an A/A test instead. We randomly split the users into two groups, but everyone gets the blue button. There is a range of differences (close to zero) that we could reasonably expect to see. When the results of the A/B test are statistically significant, it means they would be highly unusual to see under an A/A test. In that case, we would conclude that the green button did make a difference.

Cumulative Outcome Metrics

To shop on Zulily, users have to create an account. Requiring our users to be signed in is great for A/B testing, and for analytics in general. It means we can tie together all of a user’s actions via their account id, even if they switch browsers or devices. This makes it easy to measure long-term behaviors, well beyond a single session. And, since we can measure them, we can A/B test for them.

One of the common outcomes we measure at Zulily is purchasing. A short-term outcome would be: How much did this user spend in the session when they saw the blue or green button? A long-term outcome would be: How much did the user spend during the A/B test? Whenever a user sees the control or test experience, we say they were exposed. A user can be exposed repeatedly over the course of a test. We accumulate outcome metrics from the first exposure through the end of the test. By measuring cumulative outcomes, we can better understand long-term impact and not be distracted by novelty effects.

Lift

Usually, A/B test analysis measures the difference between version B and version A. For an outcome metric x, the difference between test and control is xB — xA. This difference, especially for cumulative outcomes, can increase over time. Consider the example of spend per exposed user. As the A/B test goes on, both groups keep purchasing and accumulating more spend. Version B is a success if the test group’s spend increases faster than the control’s.

Instead of difference, we measure the lift of B over A. Lift scales the difference by the baseline value. For an outcome metric x, the lift of test over control is (xB — xA) / xA * 100%. We have found that lift for cumulative metrics tends to be stable over time.

Power Analysis

Before starting an A/B test, it is good to ask two questions: What percent of users should get test versus control? and How long will the test need to run? The formal statistical way of answering these questions is a power analysis. First, we need to know what is the smallest difference (or lift) that would be meaningful to the business. This is called the effect size. Second, we need to know how much the outcome metric typically fluctuates. The power analysis calculates the sample size, the number of users needed to detect this size of effect with statistical significance.

There are two components to using the sample size. The split is the fraction of users in test versus control, and this impacts the sample size needed. The time for the test is however long it will take to expose that many users. Since users can come back and be exposed again, the cumulative number exposed will grow more slowly as time goes on. Purely mathematically, the more unbalanced the split (the further from 50–50 in either direction), the longer the test. Likewise, the smaller the effect size, the longer the test.

Size + Time — Practical Considerations

Often the power analysis doesn’t tell the whole story. For example, at Zulily we have a strong weekly cycle — people shop differently on weekends from weekdays. We always recommend running A/B tests for at least one week, and ideally in multiples of seven days. Of course, if the results look dramatically negative after the first day or two, it is fine to turn off the test early.

The balance of the split affects the length of the test run, but we also consider the level of risk. If we have a big program with lots of moving parts, we might start with 90% control, 10% test. On the flip side, if we want to make sure an important feature keeps providing lift, we might maintain a holdout with 5% control, 95% test. But, if we have a low risk test, such as a small UI change, a split at 50% control, 50% test will mean shorter testing time.

Goals for the A/B Split

There are three key properties that any splitting strategy should have. First, the users should be randomly assigned to treatments. That way, all other characteristics of the users will be approximately the same for each of the treatment groups. The only difference going in is the treatment, so we can conclude that any differences coming out were caused by the treatment. Second, the treatments for each A/B test should be assigned independently from all other A/B tests. That way, we can run many A/B tests simultaneously and not worry about them interfering with each other’s results. Of course, it wouldn’t make sense to apply two conflicting tests to the same feature at the same time. Third, the split should be reproducible. The same user should always be assigned to the same treatment of a test. The treatment shouldn’t vary randomly from page to page or from visit to visit.

Our Strategy

At Zulily, our splitting strategy is to combine the user id with the test name and apply a hash function. The result is a bucket number for that user in that test. We often set up more buckets than treatments. This provides the flexibility to start with a small test group and later increase it by moving some of the buckets from control to test.

Our splitting strategy has all three key properties. First, the hash produces pseudo-random bucketing. Second, by including the test name, the user will get independent buckets for different tests. Third, the bucket is reproducible because the hash function is deterministic.

The hash is very fast to compute, so developers don’t have to worry about the A/B split slowing down their code. To implement a test, at the decision point in the code the developer places a call to our standard test lookup function with the test name and user id. It returns the bucket number and treatment name, so the user can be directed to version A or version B. Behind the scenes, the test lookup function generates a clickstream log with the test name, user id, timestamp, and bucket. We on the Data Science team use the clickstream records to know exactly who was exposed to which test when and which treatment they were assigned.

Audience v. Exposure

There are two main ways to assign users to an A/B test: using an audience or exposure. In an audience-based test, before the test launches we create an audience — a group of users who should be in the test — and randomly split them into control and test. Then we measure all of those users’ behavior for the entire test period. This is straightforward but imprecise. Not everyone in the audience will actually be touched by the A/B test. The results are statistically valid, but it will be more difficult to detect an effect due to the extra noise.

Instead, we prefer exposure-based testing. The user is only assigned to a treatment when they reach the feature being tested. The number of exposed users increases as the test runs. The only users in the analysis are those who could have been impacted by the A/B test, so it is easier to detect a lift. In addition, we only measure the cumulative outcomes starting from each user’s first exposure. This further refines the results by excluding anything a user might have done before they had a chance to be influenced by the test.

A Bit of Roman Mythology

The ancient Romans had a concept of the Three Fates. These were three women who control each mortal’s thread of life. First, Nona spins the thread, then Decima measures it, and finally Morta cuts it when the life is over. We named our A/B test analysis system Decima because it measures all of the live tests at Zulily.

Decima UI

The Decima UI is the face of the system to internal users. These include PMs, analysts, developers, or anyone interested in the results of an A/B test. It has two main sections: the navigation and information panel and the results panel. Figure XX shows a screenshot of Decima displaying a demo A/B test.

Navigation + Information

The navigation and information panel is on the left. A/B tests are organized by Namespace or area of the business. Within a namespace, the Experiment drop-down lists the names of all live tests. The Platform drills down to just exposures and outcomes that occurred on that platform or group of platforms (all-apps, mobile-web, etc). The Segmentation drills down to users in a particular segment (new vs existing, US vs international, etc).

The date information shows the analysis_start_date and analysis_end_date. The results are for exposures and outcomes that occurred in this date range, inclusive. The n_days shows the length of the date range. The analysis_run_date shows the timestamp when the results were computed. For live tests, the end date is always yesterday and the run date is always this morning.

Results

The main panel displays the results for each outcome metric. We analyze whether the lift is zero or statistically significantly different from zero. If a lift is significant and positive, it is colored green. If it is significant and negative, it is colored orange. If it is flat, it is left gray. The plot shows the estimated lift and its 95% confidence interval. It is easy to see whether or not the confidence interval contains zero.

The table shows the average (or proportion for a binary outcome), standard deviation, and sample size for each treatment group. Based on the statistical analysis, it shows the estimated lift, confidence interval bounds, and p-value for comparing each test group to the control.

Common Metrics

We use a variety of outcome metrics depending on the goal of the new feature being tested. Our core metrics include purchasing and visiting behaviors. Specifically, spend per exposed approximates the impact of the test to our top-line. For each exposed user, we measure the cumulative spend (possibly zero) between their first exposure date and the analysis end date. Then we average this across all users for each treatment group. Spend per exposed can be broken down into two components: chance of purchase and spend per purchaser. Sometimes a test might cause more users to purchase but spend lower amounts, or vice versa. Spend per exposed combines the two to capture the overall impact. Revisit rate measures the impact of the test to repeat engagement. For each exposed user, we count the number of days they came back after their first exposure date. We have found that visit frequency is a strong predictor of future behaviors, months down the road.

Three Modules of Decima

Decima is comprised of three main modules. Each is named after a famous contributor to the field that corresponds to its role. Codd invented the relational database model, so the codd module assembles the user-level dataset from our data warehouse. Gauss was an influential statistician (the Gaussian or Normal distribution is named after him), so the gauss module performs the statistical analysis. Tufte is considered a pioneer in data visualization, so the tufte module displays the results in the Decima UI. Decima runs in Google Compute Engine (GCE), with a separate Docker container for each module.

Codd

The codd module is in charge of assembling the dataset. It is written in Python. It uses recursive formatting to compose the query out of parameterized query components, filling values for the dates, test name, etc. Then it submits the query to the data warehouse in Google BigQuery and exports the resulting dataset to Google Cloud Storage (GCS).

Gauss

The gauss module takes care of the statistical analysis. It is written in R. It imports the dataset produced by codd from GCS into a data.table. It loops through the outcome metrics and performs the statistical test for lift for each one using speedglm. It also loops through platforms and segmentations to generate results for the drill downs. Finally, it gathers all the results and writes them out to a file in GCS.

Tufte

The tufte module serves the result visualizations. It is also written in R. It imports the results file produced by gauss from GCS. It creates the tables and plots for each metric in the test using ggplot2. It displays them in an interactive UI using shiny. The UI is hosted in GCE and can be accessed by anyone at Zulily.

Decima Meta

The fourth module of Decima is decima-meta. It doesn’t contain any software, just queries and configuration files. The queries are broken down into reusable pieces. For example, the exposure query and outcome metrics query can be mixed and matched. Each query piece has parameters for frequently changed values, such as dates or test ids. The configuration files are written in JSON and there is one per A/B test. They specify all the query pieces and parameters for codd, as well as the outcome metrics for gauss. The idea is: running an A/B test analysis should be as easy as adding a configuration file for it.

Julie Michelman is a Data Scientist at zulily. She designs and analyzes A/B tests, utilizing Decima, the in-house A/B test analysis tool she helped build. She also builds machine learning models that are used across the business, including marketing, merchandising, and the recommender system. Julie holds a Master’s in Statistics from the University of Washington.

Figure 1. https://www.freepik.com/free-icon/multiple-users-silhouette_736514.htm, https://en.wikipedia.org/wiki/Normal_distribution
Figure 5. http://bytesdaily.blogspot.com/2015/12/some-december-trivia.html
Figure 9. https://en.wikipedia.org/wiki/Edgar_F._Codd
Figure 10. https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss
Figure 11. https://en.wikipedia.org/wiki/Edward_Tufte

Originally published at https://zulily-tech.com on April 6, 2018.