How Meta scaled regression adjustment to improve power across hundreds of thousands of experiments on our AB testing platform
Author: John Meakin, Saurabh Sangwan
This article covers practical challenges and some solutions to scaling a well known and well publicized method in the analysis of A/B Tests called “Regression Adjustment For Experimental Data,” also popularly known as CUPED. While there are many articles and papers, too many to list, that cover the math and intuition behind this clever technique, few discuss the practical challenges of scaling this to a large organization of software developers and decision makers.
TL;DR
- There are several straightforward optimizations you can make to save significant computational and latency costs when performing regression adjustment for experimental analysis.
- This article first walks you through how you can do this “not at scale” in plain language and a simple notebook, then highlights some notable optimizations.
- However as you inevitably scale this capability, new more complex problems arise; and thus an equally rich set of opportunities exist.
- Most notably, while more flexibility leads to better precision, enabling flexibility inevitably erodes some of the optimizations and or has implications on decision making and governance at a large company, and comes with complex trade-offs we need to make
Motivation
A/B Testing is the lifeblood of the development process at Meta. Almost all changes to the site must go through some form of A/B test; both to ensure the product is working as intended as well as to measure potential impacts of such changes. Therefore, the ability to have fast and sensitive measurements to these changes is highly desirable.
Regression adjusted Mean (aka Mean2.0 at Meta) was one of the most powerful enhancements to existing experimental capabilities, often improving detectability (aka shrinking confidence intervals) by more than 30%. This feature continues to pay dividends allowing the business to move much faster, more safely, and with higher confidence. This is primarily achieved through 1) enabling more concurrent A/B Tests, 2) allowing us to test on the smallest possible cohorts of users so as to minimize potential negative impacts, and 3) being able to detect the smallest possible effects for any given test. While this article does not go into detail about the mathematical/theoretical underpinnings, having the basic intuition for why “Mean2.0” works and how it works is important for understanding both why it’s so powerful but also why it’s so complex to scale.
The ability of Mean2.0 to meaningfully shrink confidence intervals, essentially comes from the within user correlation between metrics over time (denoted rho for the rest of this blog). One way to think of the intuition here is that user level variation is what causes our confidence intervals to be wide in the first place; explaining away some of this variation, in a way not influenced by the experiment (i.e. by using what happened before the experiment to predict user values during the experiment) helps to reduce the overall variance of our estimator. This diagram shows how confidence interval widths vary in relation to this rho. Metrics that are highly correlated over time, which happen to also be many of the metrics that we care about like Daily Active Users (DAU), Sessions, etc, tend to get a lot of benefit from this technique.
The chart above, is basically derived from taking the variance of the estimator we use, whose estimated form is shown in the formula below:
In “real life” we see very large noticeable differences when Mean2.0 vs regular (Mean1.0) is used. Below is a unique snapshot from our readout tool where we had a bug and on some days regular mean deltas were computed, but on other days Mean2 was computed. This is of course not the only way to show this, but it’s a favorite internally.
Before we get into scaling however, it is probably useful to just show how you can do “Mean2.0” yourself, for a single A/B Test!
How2-Mean2 (Not At Scale)
On the surface, computing Mean2.0 for a single experiment/metric pair at a single point in time is actually quite straightforward…
For a single experiment and metric pair, you need the following ingredients:
“Exposures” — A table containing users, and their treatment assignments, as well as the time they first experienced treatment eligibility.
“Metrics-Pre” — A table containing these same users, or a subset, and their metric values before the experiment started
“Metrics-Post” — A table containing metric values from after the experiment started, typically your experiment’s reading period
And you can take the following steps:
- Combine the exposures to the pre-treatment and post-treatment data to create a data frame like this:
2. Process the data
- First, we need to mean-center the pre-treatment data (2.25 in this case)
- We need to convert the treatment values to binary data
- Then we create an interaction term between treatment and our centered pre-treatment data
3. Run the regression:
4. Compute the right betas and compute the final result and confidence intervals:
The p-value and confidence intervals can be pulled from the regression output for the treatment covariate (beta1). This ( Mean2.0.ipynb) simple public jupyter notebook that creates fake data and shows this procedure.
While the above approach can enable you to do this for a single small experiment one time, it’s not going to work very well if you have a lot of data or a lot of metrics and need to produce them reliably on a daily basis as is common for most companies experiment readouts. As your company grows, so will the number of experimenters, experiments, and amount of data you need to process. You’ll quickly not be able to do regression adjustment in python or R very reliably, and eventually probably not at all. Below are some basic tips on how to do it better:
Top 5 Tips For Scaling
These 5 optimizations lead to extraordinary improvements to both the latency of the results (i.e. the time between when all the data lands and the time results are available to decision makers) as well as the total compute cost needed to produce all the results.
- Move the computation to a mature query engine, and compute only sufficient statistics
- At Meta we use Presto, but Spark, or other query engines are okay too. Presto is developed at Meta so we have a close collaboration with them.
- Since betas in a linear regression can easily be computed as shown below, we can simply create sufficient statistics using presto functions, we don’t actually need data to be in a pandas or in a single in-memory data frame.
2. Batch as much as you can
- At Meta, we batch many experiments together with many metrics at the same time
- For most metrics, many different experiments need to compute results for the same metric. Batching allows us to only pass through the metric data once, rather than reading the metrics each time we do the computation
- However this can become quite complex for Mean2.0, because when we batch many experiments starting at different times, we need to also ensure that we can access pre-treatment data for each of them (which can be hard if multiple experiments in the same batch started at much different times), therefore…
3. Pre-Aggregate/Pre-Process as much as possible
- To get around the batching complexity we generate many special pre-aggregated data structures. For example, for the pre-treatment data, we create a table with one row per user, and a map of date values pairs data that can be used as pretreatment. That way during the computation, it’s efficient when the same user needs potentially different values for pre-data depending on each experiment they are in.
- We also pre-aggregate other things as well like winsorization thresholds, exposure counts and more.
4. Process as few rows as possible!
- We use INNER JOIN to combine all the data needed for a result
- In the example shown above, metrics are combined with exposures in a way that keeps all users even if they don’t have metric values, and impute 0s. However, we can use the law of total covariance and some basic other statistical principles to re-construct overall means, variances and covariances from only users in both the exposure and metric data, knowing the distribution of anyone not in this process is just an array of 0s. Note this is one of many reasons we pre-aggregate exposure counts.
- This means that when we combine exposures, to pre-and post-treatment data, we can do inner joins, as well as a separate cheaper one time job for total counts which improves the performance considerably.
5. Invest in scheduling & observability
- To get this all working really well you need a good system to schedule all your ingestion and processing pipelines.
- You also need good monitoring on these tasks to make sure that failures are caught and addressed as soon as possible
Although these five simple things enable incredible scale, it’s clear to see that many of the optimizations also put restraints on the computation engine that limit its flexibility. These additional challenges have some solutions, but also come with complex tradeoffs.
Practical Additional Challenges
As experimenters become accustomed to the incredible benefits of Mean2.0, they also begin to realize the potential value of expanding this technique. In particular, they realize there are a lot of factors you might want to change/optimize in our scheduled system that could lead to additional improvements to detectability (narrower confidence intervals). Annoyingly, although many of these changes are simply extensions of Mean2.0, most are best when tuned to specific metrics, and even specific cohorts of users; both of which vary widely across the large variety of A/B Tests we run. Some top examples include:
- Changing the pre-treatment metric date range: Changing the time range of pre-treatment data can improve rho (pre-post correlation), but can also depend on how long the experiment has been running and the cohort of users in it. For example, for some tests, if we want a good read in the first week, using the week before its start as Mean2.0 pre-data can lead to very good detectability gains. But using this same 7 days of pre-data for the same experiment reading two weeks later can be much worse than using a longer pre-data time range. However, this isn’t even universally true and depends on things like the retention of the cohort of users in the test.
- Changing the pre-treatment metric selection: Similarly, there may be combinations of metrics, or adding additional metrics entirely, that can narrow CIs more than just using the individual metric itself. This is because the more explainability you have on the posttreatment metric the better and additional metrics can often help. However the extent that this matters can also vary across use cases.
- Weighting the pre-treatment metrics: Weighting a pretreatment covariate, for example by the date delta between their first exposure time and the end of the posttreatment metric time range, can improve rho as well. The intuition here is that given a fixed pretreatment data window, but a variable posttreatment one (e.g. in cases when an experiment has a gradual ramp up; 1v1, 10v10, 25v25, 50v50) a user’s posttreatment metric values are better predicted if we account for this duration by weighting the pretreatment data.
- Different design methods entirely: This blog won’t go too deep here because it gravitates far from “Mean2.0”, but clearly there are many things being done here at Meta, including cluster experimentation, bipartite analysis, and of course way more, each obviously requiring far more sophisticated computation.
Ultimately, the flexibility above has very unique challenges associated with it. Building flexibility into the optimized batched/scheduled system comes with increased development costs as well as increased compute costs, particularly if we want to scale the ability for customizations across the many business cases we serve.
One of our solutions to this flexibility, is an Ad-Hoc system in parallel to the scheduled system which allows users to run interactive queries for single experiments but only on demand; and APIs that allow for even further extended flexibility. Note that the Ad-hoc approach still uses most of the optimizations mentioned above (e.g. run on presto, leverages pre-aggregation when possible, leverages special data structures, and processes as few rows as possible), but of course can not rely on batching or scheduling.
Because this Ad-hoc system runs for a single experiment at a time, it’s easier to add flexibility, and thus improve sensitivity/detectability for experimenters; however one of the biggest challenges we face is trading off the pros and cons of these two parallel systems both regarding trade-offs between Scalability VS Detectability as well as other more complex tradeoffs around traceability, repeatability of results, and governance. Additionally, one challenge to having both systems in parallel, is that the batched/scheduled results and Ad-hoc results can diverge for the same metric and time range for subtle reasons due to scaling, in a way non-transparent to experimenters causing confusion about which results to use. The table below maps out some of the additional Pros and Cons to highlight the complexity of the trade-offs we need to make.
Closing Thoughts
Currently we are working towards understanding and quantifying these tradeoffs better in parallel to improving result precision in a number of ways. Of course each one of the topics in this blog could be its own topic entirely, and we’re looking forward to potentially writing some of those. However, in the meantime, there is lots to be done in the A/B Testing space to enable the business to move much faster, more safely, and with higher confidence!