A Framework for Determining How Much Data You Need for an Experiment

Part 2 of our A/B Testing Series

Published in

art/work -behind the scenes at patreon

10 min readJun 28, 2017

Part one of our A/B testing series provided a clear framework to design randomized experiments. There, we established that a correct experiment is designed by clearly defining:

Observation units: the unit that we randomly assign to a treatment, and for which we measure a response metric (usually a user, but could also be a web request, a device-browser combination, a machine, etc.)
Treatments: the set of experiences that the observation units can potentially have during the experiment
Response metrics: the measurement that we use to represent the state of each observation unit after they experience a treatment

We also discussed that the ultimate goal of experimentation is to identify whether applying a treatment to an observation unit causes the response metric to change. The most common A/B testing procedure to achieve this goal is:

randomly split a sample of the eligible population into treatment and control groups
apply the treatment to the observation units in the treatment group
use a statistical procedure to identify significant (non-random) differences in the response metric between the treatment and control groups

One of the most important factors to consider at the design stage of an experiment is the amount of data required to decide confidently whether the groups differ. It’s intuitive that more data leads to better conclusions. The trade-off is also clear: more data means more money or more time. However, it’s not easy to determine a specific sample size without diving into a statistics book and dealing with some non-trivial math. This post creates a bridge so that experimenters can perform formal required sample size calculations without going deeply into the mathematical details.

First, we’ll walk through a series of preparatory steps that are essential to determine a sample size:

Understand the distribution of the response metric
Determine a treatment effect type and minimum effect to detect
Pick a target confidence and power
Select the statistical procedure to detect differences in the response metric

Next, we’ll discuss a simulation procedure we use at Patreon to determine the exact amount of data that we need for any given experiment. Finally, we’ll illustrate the whole methodology with a real-world example.

A walk through the preparatory steps

Understand the distribution of the response metric

The first thing you should do once you’ve determined the main components of the experiment (observation unit, treatment and response metric), is understand the distribution of the response metric for the eligible population. Often, it is possible to do this with historical data, but if empirical data is not available, you can assume an appropriate probability distribution based on your domain knowledge. Besides forcing you to think about the range of values you expect to observe in the response metric, this allows you to calculate baseline summary metrics for the distribution (average and variance are the most relevant ones).

Some common distributions for the response metric are:

Binary (e.g., Bernoulli): whether a user signs up or not at the end of a sign-up funnel.
Count (e.g., Poisson): how many pledges a patron makes in a given month.
Skewed continuous distribution (e.g., Gamma): the amount of income that a new creator will make during their 6 month at Patreon.
Symmetric continuous distribution (e.g., Normal): the net change in a Patreon creator pledge base from one month to another.

Determine a treatment effect type and minimum effect to detect

Next, you should think about what type of effect your treatment will have on the response metric. The most common assumption is that the treatment will cause the response metric to increase (or decrease) by a constant value for every observation unit. Because you’ve already established some baseline values during the first step, you should be able to quantify the size of the effect concretely. For example, if you are evaluating the amount of income that a new creator will make during their sixth month at Patreon, you could say that the treatment will cause them to have a additional $15 (on top of what they would make without the treatment). Or if you know that average earnings at the sixth month is $1000, then you could assume that the treatment will cause a 1% boost, which would be equivalent to $10 extra for everyone who gets the treatment.

Once the effect type is settled, think about the minimum effect size you are interested in detecting. Keep in mind that not all effect sizes are relevant in every domain, regardless of how statistically significant they are. If it’s hard to determine whether a specific effect in your response metric is relevant from a business perspective, you should link it, using at least a back-of-the-envelope calculation, to one of your core company metrics where you can make that assessment. For example, you might want to understand how a user sign-up translates to income.

Pick a target confidence and power

Because the ultimate goal of running an experiment is to conclude whether the treatment caused an effect, one way to assess the quality and robustness of your experimental procedure is on the basis of:

Confidence: The probability that you conclude the treatment did not have an effect when it in fact did not.
Power: The probability that you conclude the treatment had an effect when there in fact was an effect at least as big as the minimum relevant effect determined in the previous section.

The higher the confidence and power, the better your experimental procedure.

To determine a target confidence and power, you need to think about the consequences of making each mistake. For example, if you’re testing a product or feature that you expect will have a positive impact on the response metric, low power might mislead you to think that you didn’t have the minimum effect that you were aiming for. This might cause you to roll back the feature and lose its potential gains. Having low confidence, on the other hand, could make you believe that there is an effect, when in fact there isn’t, causing you to roll out and maintain a feature that doesn’t have the effect you observed in the experiment. At Patreon, we aim to run experiments that have 90% confidence and 90% power, but we adjust these values on a case-by-case basis, depending on the feature’s importance.

Select the statistical procedure to detect differences in the response metric

Lastly, you should define the statistical procedure that you will use to assess whether the groups differ after running the experiment. Different tests are better at detecting certain type of effects. For example, a t-test for a difference in averages is usually good if you expect the treatment to impact all observation units similarly, but under different assumptions there might be better statistical methods. In a future blog post, we will cover a few statistical techniques that can be used under different scenarios, so keep an eye out for it!

The POWER of simulation

You’ll know you’re ready to determine the exact amount of data (or sample size) required for your experiment when you have:

a historical sample of the response metric for your eligible population (or some theoretical distribution)
clarity on what type of effect your treatment will cause and the effect size that you want to detect
target values for your confidence and power
a statistical approach to compare the groups

All these elements are inputs required to determine the optimal sample size. Their impact on sample size, under the assumption that all other factors are kept constant, is intuitive; each of the following would require more data:

More variance in the distribution of the response metric (we need more data to detect the signal when there’s a lot of noise)
Smaller effect size (we need more data to detect more subtle effects)
More power or confidence (we need more data to make fewer mistakes)
Worse statistical approach (we need more data if we use a statistical approach that is not suited to analyze the type of effect that the treatment causes)

However, it’s not trivial to quantify precisely how much each element will impact the required sample size in any given experiment. For this reason, at Patreon, we use a simulation framework to avoid difficult math.

The framework is designed with power as the dependent variable, and all other factors (including sample size) are treated as inputs. However, we can easily reverse-engineer it to make conclusions about required sample sizes. Our framework works in the following way:

The great thing about this approach is that it is flexible and modular enough so that we can move any of the parameters and understand how our experiments react to the changes. Additionally, the framework ultimately allows us to:

facilitate discussions with product managers about what type and size of effect our features need to have for A/B testing or experimentation to be the right tool to assess their success
avoid running and analyzing experiments for which we don’t have enough data to reach meaningful conclusions (this saves a lot of resources)
be more confident about our positive or negative experimental results, because we trust that we correctly determined how much data to collect

A real-world example (with fake numbers)

In this section, we’ll illustrate how the above procedure (preparatory steps and simulations) was used in a real-world example to determine a sample size for a Patreon experiment.

Consider the following example of a feature that we released a few months ago. The feature allows creators to schedule their posts ahead of time. The numbers in the example are solely illustrative.

We began with the three core definitions for our experiment:

the observation unit was every Patreon creator active when we released the feature to a random subset of creators
there were two treatments: having access to the new feature (active treatment) or not having access (control)
the main response metric was the number of posts that the creator made during the one-month experiment

Then, we proceeded with the preparatory stage:

Historical sample: we pulled data for the number of posts that creators made during the month prior to the experiment. Let’s say this sample had 20,000 creators and it produced a right-skewed histogram with a mean of four posts per month and a standard deviation of one post.

Type and size of effect: We assumed that even though we were going to release the feature to a certain number of participating creators and let them know that they could use it, not all of them would actually be impacted by it. Therefore, the way we modeled the effect on the response metric during each simulation was by assuming that only 25% of the creators that had access to the feature would post more than they otherwise would have, and the rest would not be impacted by the feature. We additionally thought it was reasonable to assume that this 25% of creators would increase their observed monthly post count (the response metric) by one post.

Confidence and power: For this feature we used 90% confidence and 90% power.

Statistical procedure: We used a t-test to detect a difference in means.

Finally, we started simulating. The first simulation was done assuming a sample of 5,000 creators.

The steps for a single simulation were:

Sample 5,000 creators with replacement from the 20,000 creators in the historical data.
Randomly select 50% of those 5,000 creators to be in the active treatment group.
Randomly select 25% of the 2,500 creators selected in the previous step and add one post to their observed historical response metric (the remaining 75% keep their original observed response metric).
Run a t-test, with a 90% confidence level, comparing the 2,500 creators in the treatment group with the 2,500 in the control group, and record whether we detect the difference.

We repeated the simulation 10,000 times to determine the proportion of cases where we correctly identified the effect. This proportion is an estimate of the power of the procedure under the established assumptions. Because a sample size of 5,000 creators did not achieve the desired 90% power, we increased the sample size several times in increments of 1,000 and repeated the procedure until we reached our target 90% power with a sample size of 15,000 creators.

From this procedure, we were able to conclude that if we ran this experiment with two groups of 7,500 units each, and the feature actually caused 25% or more creators in the treatment group to increase their monthly number of posts by at least one post, we would be sure that a 90%-confidence t-test would detect the effect with at least 90% probability.

Conclusion

As stated throughout the post, the factors that come into play when determining sample size for an experiment are the distribution of response metric, the type of treatment effect and it’s size, power, confidence, and statistical comparison procedure. Though these factors’ directional impact on required sample size is intuitive, it is not easy to quantify their precise impact on every experiment. Simulation makes this task much easier.

Our hope is that after reading this post, when someone asks, “How much data do we need for this test? How long do we need to run our experiment? Are we running experiments with sufficient power?” you won’t need to make up a hand-wavy and unintelligible explanation, or spend hours reading statistics literature. You can just send them to read this post while you run the necessary simulations to answer their question in an accurate way!