Understanding how CUPED in GrowthBook Reduces Experiment Runtimes at the Los Angeles Times

Published in

GrowthBook

8 min readNov 3, 2023

Luke Sonnet — Data Science @GrowthBook
Jane Carlen — Data Science @Los Angeles Times
Julianna Harwood — Data Science @Los Angeles Times

Variance reduction via CUPED (short for Controlled Experiment Using Pre-Experiment Data) can help your experiments reach statistical significance with a fraction of the number of users you would otherwise need, resulting in dramatically reduced experiment runtimes. In this blog post, GrowthBook, an open core experimentation platform, and the Los Angeles Times data science team demonstrate how CUPED can reduce needed sample sizes by up to 85%.

While the ins and outs of CUPED as well its potential impact have been covered before (Deng et al. 2013, Lin 2013, Netflix 2016, Booking.com 2018, Microsoft 2022), in this blog post we focus on two important factors that can dictate how well CUPED can work for you:

how the type of metric influences gains from variance reduction
how you can optimize CUPED in GrowthBook by selecting the amount of historical data to use for variance reduction

How much time can CUPED save me anyways?

Throughout this blog post, we will use data from an experiment run in production by the Los Angeles Times’ data science team. The experiment tested different feeds for articles in a “Recommended for You” module, like the one shown below:

Figure 1: An example of the Recommended For You module, the target of the experiment

This experiment ran for about a month, and only registered users were eligible. While the experimenters at the LA Times had several goals in mind, in this blog post we’ll focus on three metrics:

page_view: how many times a particular page has been viewed by a user
page_click: how many times a user has clicked something on the above page
page_ctr: a custom user-level click-through-rate measurement that is min(page_click / page_view, 1). Users with 0 page_view are also given a value of 0.

Using this experiment data, we can calculate each metric’s average and variance (with and without CUPED), and compute the number of users we would need to estimate a 2% uplift (relative change) in all three metrics with 80% power. The following table shows our results with and without CUPED:

Table 1: Necessary sample size with and without CUPED to detect a 2% uplift at 80% power in the Los Angeles times experiment case study

These results show that with the default CUPED implementation in GrowthBook, the necessary number of users to detect a 2% uplift would be as little as 15% of what would be needed without CUPED! Not using CUPED for the page_view and page_click metrics would mean you need to collect over 6x as many users to detect this effect reliably.

An alternative way to view the impact of CUPED is to look at the minimum detectable effect at 80% power for the 93k user sample collected in the experiment:

Table 2: Minimum detectable effects with and without CUPED for the given 93k user sample n the Los Angeles times experiment case study

With CUPED, the experiment can regularly detect effects that are half as small as without CUPED.

We can also see the power of CUPED in action on this experiment in the GrowthBook UI. The following screenshot shows the results for two versions of the page_click metric, one with CUPED and one without. The confidence interval is less than half as wide with CUPED, and the overall effect of the experiment becomes much clearer!

Figure 2: The impact of CUPED on the page_click metric as shown in the GrowthBook UI

For which metrics is CUPED best?

CUPED has the biggest impact when there is a lot of variation among users in the behavior you are measuring, but the behavior of individual users is consistent over time

The above results show that the necessary sample size to detect a 2% lift in page_ctr (89k) is considerably smaller than the necessary sample size for a similar lift in the other metrics (~300+k). Furthermore, the percent reduction in necessary sample size from CUPED is also much smaller in relative terms for the page_ctr metric.

This difference comes down to how page_ctr is constructed and the high variance in the page_view and page_click metrics. The page_ctr metric ranges from 0 to 1, while the other metrics range from 0 to the thousands. The following figures show the distributions of these two metrics in this experimental sample:

Figure 3:The distribution of user values of page_view and page_click in the experiment sample

Figure 4: The distribution of user values of page_ctr in the experiment sample

The long tail of the page_view and page_click metrics means two things:

A 2% increase is a relatively small effect with respect to the (large) variance of these metrics. Detecting small effects is harder and requires more users. This is why the target of 2% uplift requires more users for page_view and page_click without CUPED.
If the users out in the tail of this metric have similar behavior in the pre-experiment and during-experiment periods, then CUPED has an easier time reducing the large variance of the page_view and page_click metrics.

Therefore, CUPED can be particularly effective for metrics with long tails where that long tail behavior is largely predictable. In fact, in this case, the correlation between LA Times users’ page_view metric between the pre-exposure and post-exposure periods is over 0.9!

This is consistent with past analysis at the LA Times. Highly engaged LA Times readers tend to stay highly engaged, while less engaged readers tend to stay less engaged. Occasionally, events and breaking news can drive up engagement for everyone, but that increase is usually proportional to readers’ baseline engagement and past behavior remains strongly predictive of future behavior.

On the other hand, the page_ctr metric is truncated. This reduces the natural variance across users and also limits the correlation between the pre- and post-exposure values. As a result, CUPED is less able to explain away in-experiment variation using pre-experiment behavior. Nonetheless, even with these limitations, CUPED reduces the necessary sample size to 73% of what is needed without CUPED.

How can I optimize CUPED performance?

At its core, a standard CUPED implementation reduces variance by measuring the outcome metric itself during the pre-experiment period. In this setup, CUPED performs best when the metric is correlated with itself over time (known as “auto-correlation”), as described above.

There are ways to try and fine-tune your metrics to be more stable over time, but one of the easiest ways to optimize this auto-correlation is to select the appropriate lookback window.

In the following example we use 14 days (GrowthBook’s default) before an exposure to serve as the pre-exposure period. For each user, we use the 14 days before their exposure as the main covariate when reducing variance via CUPED.

Figure 5: The date ranges used in CUPED adjustment in GrowthBook

This setting is configurable in GrowthBook. At first it may be tempting to think “more data is better!” but this isn’t always the case. There are several pros and cons to increasing this window:

A longer lookback window…

✅ can ensure that metrics are more stable for a user, and more likely to be correlated with future behavior; this can lead to better performing variance reduction

✅ ensures that you get at least some data, in the case that users only visit your site every couple of weeks or so

❌ might begin to incorporate stale, noisy data; for example, a user’s behavior 3 months ago may not be indicative of their current behavior

❌ requires querying more data, which can slow analysis runtimes and increase query costs

If runtime or query costs are manageable, or not of the utmost priority, than you can simply analyze the auto-correlation in your metric over time and pick the lookback window with the highest correlation.

Selecting the right lookback window for the LA Times experiment

Using the experiment we analyzed above, we varied the lookback window for the three metrics and plot below the percent of the unadjusted sample size that we need with each window. As you can see, for the page_view and page_click metrics, looking back 30 days produces the best variance reduction. More days, besides requiring more data to compute, actually hurts the CUPED variance reduction as it begins to incorporate noisy data that no longer represents current user behavior.

Figure 6: The impact of lookback days on the needed sample size with CUPED when compared to not using CUPED. For example, a value of 35% would imply that with X days as your lookback window, CUPED can reduce your needed sample size to 35% of the original.

On the other hand, for the page_ctr metric it seems that more data is better. The calculation of the click-through rate is more complex and perhaps it takes longer for this ratio to smooth out over a user’s history.

In GrowthBook you are free to set the lookback windows at different values for each metric. For page_click and page_view, you can set the lookback window to the optimal 30 day window. For page_ctr you could go to 90 days to maximize variance reduction, but that is a lot of data to roll-up, and you may be better off choosing a value like 50 or 60 days where you receive most of the benefit while scanning much less data.

As has been well established to this point, CUPED can be incredibly valuable to accelerate your experimentation program. In this blog post, we demonstrated using an applied setting how its impact is largest for metrics that naturally have high variance and that is relatively stable overtime. We also showed how you should consider explicitly evaluating the correlation between your pre- and post-experiment windows to set the optimal number of days to include in your variance reduction method. More is not always better, and at some points there are diminishing returns to including additional historical data.

Understanding how CUPED in GrowthBook Reduces Experiment Runtimes at the Los Angeles Times

How much time can CUPED save me anyways?

For which metrics is CUPED best?

How can I optimize CUPED performance?

Selecting the right lookback window for the LA Times experiment

Further reading

Written by Luke Sonnet