Integrating Power Analysis into Experimentation Platforms for Reliable A/B Testing

Michael Ramm
disney-streaming
Published in
6 min readJan 7, 2022
Photo by ColiN00B on Pixabay.

To measure the impact of new product changes, teams typically conduct A/B experiments to compare the new features with control. Power analysis is an important part of the test design used to determine sufficient sample size for the experiment. In this post, we present why it’s valuable to build a power analysis tool directly into the experimentation platform and explain how to do it. This work aligns with one of the tenets presented in our earlier post, to let teams deploy and understand results more quickly, develop full-service capabilities that limit the reliance on analytical and engineering resources.

Intro to power analysis

In test design, power analysis determines the sample size needed to reliably detect differences of anticipated effect sizes. If we measure a positive outcome in an experiment, we would like to be confident that the positive result is a true positive (the effect is truly there) and not a false positive (no underlying effect is present). In experimentation, the cost of a false discovery can be high if, for example, the discovery affects the direction of product innovation leading the team astray. We also want to avoid false negatives — cases where the underlying effects are there and we fail to detect them, leading to missed opportunities.

Power analysis is performed prior to test launch. Typically the inputs are the metric summary statistics, the anticipated effect size, the significance level and the power. The output is the sample size needed to reliably detect the effect.

Why integrate power analysis into an experimentation platform?

Analysts can perform power analysis with ad hoc resources like online calculators and one-line functions within R and python scientific libraries. What’s the advantage of actually building the calculator into the experimentation platform? There are two related motivations, one for product managers and the other for data analysts.

For product managers, having power analysis within the tool simplifies the experimentation workflow. It reminds product managers that they should consistently run power analyses prior to launch. It eliminates the need for a data analyst to perform the power analysis. Also, it raises the important question of anticipated effect sizes, focusing the attention on evaluating the opportunity size of the proposed treatment and prioritizing ideas with high impact.

For data analysts, the built-in power analysis tool automates a manual task. While calculating power itself is straightforward, the analyst still has to calculate the anticipated number of users assigned to the experiment in a particular test window, and the metric statistics for the expected audience. This may be challenging as the analyst may need to query the internal experimentation platform logs to estimate the amount of traffic for a particular set of assignment criteria, which are a set of rules that determine which sessions are used for assignments, such as by device or location. Then the analyst would have to manually perform joins to metric tables to calculate the statistics. This “straightforward” task may take hours for each test! With the built-in power analysis, the platform directly provides the results, eliminating the required manual steps.

How did we implement it at Disney Streaming?

Our implementation of in-tool power analysis has two parts:

  1. Predict the number of assignments during the course of the test.
  2. Calculate the sample size needed to achieve the desired power for a selected metric and effect size.

Comparing these outputs will tell us if the test is sufficiently powered.

Traffic Prediction

Traffic prediction estimates how many users will enroll in the A/B test during the upcoming period. The rate of assignment is not constant each day: our platform assigns the user in the beginning of their session and the user remains enrolled for the duration of the experiment. Here highly engaged users, who are more likely to have a session, cause the assignment rate to be higher early in the assignment period. One possible implementation of traffic prediction is modeling the distribution of user behavior and time-dependant assignment rates. For our first implementation we chose another simple, but powerful idea — which is to assume that future tests will behave similarly to past ones. We make the prediction using historical information from prior user sessions.

To accomplish this, we generate a sampled dataset of all user sessions on an ongoing basis, keeping track of session start times, user identifiers (e.g. user_id), and all of the other available user context (device, geo, etc.). Then to make a prediction, we perform a dynamic query of the sampled dataset counting the unique number of users assigned in the previous period. Counting uniques is important here as the same user may have multiple sessions in the sampled period. Equally important for accurate counting is generating the sample using a consistent hash of the user id. The figure below illustrates this process.

Traffic prediction. The session service logs are sampled using a hash of the user_id to reduce the size of the dataset. These logs are then used to estimate the amount of traffic. In this example, the platform is calculating how many users will be assigned to a test within the US. The UI dynamically queries the sampled data to count the unique number of users that satisfy this condition and returns the answer to the tool for display.

This approach has the added advantage of supporting any assignment criteria. While performing the calculation we apply the desired assignment criteria to the sampled dataset to calculate the corresponding prediction. Note that if we took the approach of modeling user behavior, this could have led to a dimensional explosion because the cardinality of all possible assignment criteria is very high.

We keep the size of the sampled dataset small enough to ensure a fast query time. This way, the experimentation platform remains responsive to queries while the user is interacting with the tool.

We worked with the design team to present results in a way that’s accessible for a broad audience. For the traffic prediction, we chose a funnel chart (specific numbers are made up) which shows the entire addressable user base, how many users are expected to be enrolled into the test due to limited sampling, and how many users would satisfy the assignment criteria.

Power Calculation

For experimentation analysis, all of the metric definitions are stored in a version-controlled repository to enable reproducible and reliable analysis and we use the same metric definitions for power analysis. Here we once again start with the sampled session data. We know that any of the past sessions could have been the point of test assignment if that session satisfied the specified assignment criteria. Hence we calculate the desired metrics for all combinations of users and enrollment points. From these metrics we then calculate metric means and standard deviations. These are then fed into a standard power analysis function.

Power calculation. Starting with the sampled session logs, we compute metric values assuming that each session could have been the point of user assignment to the test. The results are materialized in a table that is dynamically queried by the UI. In the example above, we are computing metric statistics for US users. For the averages and standard deviations we use the metric value associated with the first session that satisfied the assignment conditions.

We scheduled a daily batch process to pre-compute the metric values for all samples. However, the UI query executes dynamically. This way, the platform is able to calculate summary statistics for the exact group of users selected by the assignment criteria.

We visualize the results in a widget where the platform user selects the metric they are aiming to optimize in the experiment. Then, for that specific metric, the UI presents a table of hypothetical effect sizes (shown as “MDE” on the chart — Minimum Detectable Effect) and the number of samples required to reliably detect them. To make the output more actionable, for each effect size, we provide a guide of the required experiment sampling rate to achieve the required sample size. We also present the relationship between MDE and sample size as a line chart.

Summary

In this post we introduced the concept of power analysis and described how a developer can build it into the experimentation platform to make this important part of test design broadly accessible to our users. We aim to continually iterate on the functionality and designs based on what we learn about its usage and internal user feedback. One iteration on the roadmap is a more advanced view of the power calculator for data analysts.

Please stay tuned for updates on this and other features enabling reliable experiment analysis. If you’re interested in learning more about our work, please don’t hesitate to reach out!

Acknowledgements

Developing the experimentation platform is a large cross-functional effort — thanks to everyone who contributed to this feature and the blog post: Mark Harrison, Diana Jerman, Robin Cox, Stuart Mershon, Anmeen Leong, Cynthia Wu, Kilian Scheltat, Henning Wielenberg, Doug Fertig, and many other members of the Experimentation-X team.

--

--