Speed Up Your AB Tests With Shared Control Groups & Optimal Allocation

Paul Schaffer
Engineering@Noom
Published in
8 min readJul 24, 2020

Our Product and Growth teams at Noom run many, many AB tests (sometimes dozens a day), so they are always looking for ways to improve experiment efficiency. Recently they asked the Data Science team to investigate ways to assign users to experiments more efficiently. After chewing on this question for a bit, we produced a couple of elegant formulas to maximize experiment efficiency, and were able to use a bit of algebra and calculus to prove it¹.

(By the way, we also have been optimizing the technical framework we use to conduct experiments. For more on that, check out this post by Patrick Lee!)

When running two simultaneous experiments, we might assign each user to one of four groups:

CA1: control arm for Experiment 1
EA1: experiment arm for Experiment 1
CA2: control arm for Experiment 2
EA2: experiment arm for Experiment 2

Then, when the experiments conclude, EA1 is compared with CA1, and EA2 is compared with CA2.

Though this setup is easy to implement, it’s often inefficient. The users in CA1 and CA2 are likely receiving the same treatment, so keeping the CAs separate needlessly decreases the sample size available to each experiment. In these cases, if CA1 and CA2 are blended together into a shared control group, and EA1 and EA2 are each compared to the shared control group, the results, favorable or unfavorable, become more robust.

For two concurrent experiments, if all arms are equally sized, a shared control group enables each experiment to use 2/3 of the experiment universe, as opposed to 1/2 apiece with separate control groups.

If we take this path, the number of groups to allocate users to is now three: EA1, EA2, and (shared) CA. One big remaining question, though, is how many users ought to be allocated to each group. A generally well-known rule of thumb is that for a single AB test, you can determine significance fastest with a 50/50 test-control split. Perhaps we should follow that even-allocation heuristic and allocate a third of users to each of the three groups? Spoiler: This is not the best approach.

The key intuition is that in a shared control group setup, the CA sample size appears in every experiment’s statistical test, whereas each EA sample size appears in only one statistical test. If each EA “donates” a relatively small fraction of its users to the CA, each test benefits overall from having a significantly larger CA and only a slightly smaller EA.

If you wisely allocate users, you can choose between two possible benefits:

1) Run more experiments without sacrificing statistical strength, or
2) Run the same amount of experiments as you would have otherwise, but with higher statistical strength.

If you’re curious whether the following optimization would make a meaningful difference in the experiments you’re running, you can make a copy of the Google spreadsheet here, enter different values in the blue cells, and compare the resulting p-values.

The Math

For the sake of simplicity², let’s assume that every test we run is a difference of proportions Z test:

When we change the EA-CA allocation, we can assume that the observed proportions will remain effectively constant, so the term in the Z test to optimize is in the denominator³:

Making this term smaller increases the magnitude of the Z-score if there’s a nontrivial difference between p1 and p2. One way to make the term smaller is to increase the sample size of both arms, but that’s intuitive. Everyone knows that a bigger experiment universe helps achieve statistical significance when a significant difference exists. The challenge here is to increase confidence while keeping the universe size constant.

If we let:
N = universe size
m = the number of simultaneous experiments we’re running
n = arm size (with subscripts: c = CA, e1 = EA1, etc.)

Then:

If each experiment arm has the same size, we can rewrite as:

Solving for EA size,

Now that we have EA size in terms of CA size (and constants N and m) we can plug CA size and the above expression into the part of the Z test we are hoping to minimize:

Since we’re hoping to minimize this fraction, that means we’re trying to maximize its denominator:

Calculus tells us that the way to find the CA size to maximize this expression is to take its derivative, set it equal to zero, then solve for CA size.

After excruciating algebraic manipulation (or quickly consulting Wolfram Alpha) and taking the positive root of the solution, you get:

Finally, to make this expression optimally useful, divide both sides by N to solve for (CA size)/N, the fraction of the experiment universe that should be assigned to the control group.

A pretty elegant result! Setting m = 1, we see a familiar recommendation: When you’re running only one experiment at a given time, the optimal percentage to allocate to the control group is 50%.

If we return to the case that opened this post, we were planning to run two simultaneous experiments with a shared control group and wondering whether allocating a third of the universe to each arm would be optimal. Plugging m = 2 into the above, we see the recommendation is, instead, for the control arm to be assigned 1 /(sqrt(2) + 1) = 41.4% of the universe, and for each experiment to be assigned half of the remaining 58.6%, so 29.3% apiece.

Practical Consequences

Okay, so 41.4% of the universe should be allocated to the control group instead of 33.3%… that doesn’t sound like a colossal change. How much does it actually help significance? The answer for m = 2 is a pretty small amount, but this approach really starts to pay dividends as the amount of experiments increases.

Recall the quantity we were hoping to minimize:

We determined the optimal CA size:

If we plug this into the formula we determined early on for EA size,

The ideal EA size is then given by:

After a bit of algebraic manipulation, this can be written as:

Now let’s plug these expressions for EA and CA size into the quantity we’re trying to minimize, which I’ll call X for brevity.

After minor rearrangement:

What about our original idea of allocating the same amount to the CA and each EA? What X value would that produce?

If we let n = the size of the CA and each EA:

When we look at the ratio of X(optimal) to X(even), the N’s drop out and we’re left with:

With m = 1, as we’d expect, X(optimal) = X(even) because both methods recommend a 50/50 split when you’re only running one experiment, but what happens as m grows?

For large values of m, optimal allocation gives an X term that’s half the size of the X term from even allocation. Practically speaking, X represents how much an experiment’s sample sizes enable us to have confidence in an observed difference of proportions. Since a smaller X means we can have greater confidence, this is good news for the optimal allocation formula.

Does this mean that for large m, the Z-scores are twice as large with optimal allocation? Not quite. Since the X term is square-rooted in the difference of proportions test, its impact on the Z-score is square-rooted:

For an optimization that costs nothing, though, that’s still pretty good!

To be fair, most organizations are likely running fewer than infinity concurrent shared-control-group experiments. In realistic cases, how much does optimal allocation help the Z-score and, hence, the p-value?

The table below shows the ratio of Z(optimal) to Z(even) as a function of the number of concurrent experiments, m. To make the impact of the Z-score ratio more concrete, the third column answers the hypothetical: “If optimal allocation gives a Z-score of 1.96, and therefore, a two-tailed p-value of 0.05, what is the p-value with even allocation (assuming the observed proportions are the same)?”

Shameless Plug

If you’re interested in experimenting with us, we’re hiring!

Footnotes

¹ We are decidedly not the first to come up with this. This 2014 paper by Simon Bate and Natasha A. Karp comes to the same conclusion (in particular the attached “S1 Derivations” doc). It also does the extra work of verifying that power analysis recommends this approach.

² A very similar expression also appears in the difference of means t-test, though the variances of the EA and CA distributions appear in the numerators:

If the control arm’s variance is greater than or equal to the experiment arm’s variance, though, the general optimization recommended by this post should be valid for the difference of means t-test as well (though the specific optimal percentages will differ).

³ Technically, when the ratio n1:n2 changes, the pooled proportion p̂ changes, too. However, for most cases, when we need a test of significance in the first place, p1 and p2 will be sufficiently similar that we can treat p̂ here as a constant.

--

--