A Practical Guide to Setting A/B Test Duration in a Bayesian context

Learn 7 practical considerations to decide on A/B test duration if you are using Bayesian statistics, and how you can apply these considerations using GrowthBook.

Xavier Gumara Rigol
Oda Product & Tech
8 min readAug 29, 2023

--

Photo by patricia serna on Unsplash

When planning an A/B test, one of the most common questions we hear at Oda is: “How long should we run this experiment for?” The decision regarding test duration is not always straightforward and carries significant importance for stakeholders.

Setting the time an experiment needs to run beforehand is a concept in Frequentist statistics but not in Bayesian statistics, which is the approach we use to analyze our experiments. The Frequentist approach offers a clear methodology to decide on test duration, and some of our teams apply that in the Bayesian world, which isn’t correct.

We’ve noticed there’s a lack of information on how to decide on A/B test duration within the Bayesian framework. With that in mind, we aim to bridge that gap in this article by presenting the principles we have agreed upon at Oda.

What we’d like to do is to educate other experimenters who are new to the Bayesian approach and provide them with insights on setting test duration in a Bayesian context.

The guidelines we’ll present in the next sections relate to 1) experiment duration; 2) what to do while the experiment is running; and 3) when to stop an experiment. In short:

  1. Don’t decide on experiment length up front;
  2. Reflect on the problem at hand;
  3. Choose a risk threshold for the metrics;
  4. Choose a minimum duration;
  5. Consider whether you’re constrained by a maximum duration;
  6. Decide what to do during the first week;
  7. Decide when to stop the experiment.

Now let’s go through all of them in detail.

7 considerations to decide on A/B test duration if you are using Bayesian statistics

#1 Don’t decide on experiment length up front

The first guideline when using the Bayesian approach is that experiment length shouldn’t be decided up front. This might sound scary, but it’s actually an amazing feature of the Bayesian approach and definitely something you need to get used to. Basically, don’t be tempted to use frequentist calculators to determine the experiment’s length before you start the experiment.

#2 Reflect on the problem at hand

Before the experiment begins, you should reflect on the problem at hand, the major metrics you’re attempting to influence, and how known user behavior affects these metrics. A good reasoning process might be:

How frequent are conversions for the metrics under study?
The frequency of conversions plays a role in determining how quickly we can reach reliable conclusions using the Bayesian approach. If a metric has frequent conversions, it means that we’ll observe a sufficient number of conversion events within a relatively short period of time. This allows for faster accumulation of data and in that way bring the risk closer to acceptable levels faster.

On the other hand, if the key metric has infrequent conversions, it takes longer to observe a meaningful number of conversion events. As a result, we need to collect data over a longer period to allow for more conversions, providing a more accurate estimate of the conversion rate and reducing the uncertainty associated with it.

In Oda’s case, experiments that try to impact the metric “orders per user,” for example, run for longer than experiments that want to influence the “products added to cart” metric. This is because there’s a difference in conversion frequency for those two metrics.

Will there be a novelty effect?
Some experiments will offer a slightly different user experience that can have a transient positive or negative effect on the user; other experiments will not.

If we test a new iteration of the algorithm we use to recommend products to our users, the user doesn’t need to get used to a new interface and there won’t be an impactful novelty effect. On the other hand, if we test a new way of displaying categories, there’s going to be a novelty effect as the user needs to get used to the new way of classifying products.

If you’re making a significant change to the user experience and want to prevent novelty effects, it’s a good idea to run the experiment for more time.

What’s the effect you’re trying to detect versus the current mean and standard deviation of the metrics you’re trying to impact?
If you target metrics with high variation in your experiment (the difference between mean and standard deviation) you’ll need to run the experiment for longer. Using historical data to compare the baseline metric variation with the effect you’re trying to detect can be useful to determine the length of the experiment.

What about the impact on revenue, profitability and retention?
We encourage teams to add guardrail metrics on revenue, profitability, and short term retention in experiments if these are not the metrics they’re trying to impact originally.

The time you have to run the experiment (see #5 for deciding on maximum duration) will tell you how accurate the results are on these metrics that, in general, need longer test durations.

#3 Choose a risk threshold for the metrics

The Bayesian approach is all about uncertainties and probabilities, and the idea is to make the best decision we can under the uncertainty the data shows and the risk we’re willing to accept. Sometimes you might conclude earlier than you thought because you are certain early on. Other times you’ll run it for longer and realize you need to accept some risk.

One good practice is to choose a desired error tolerance (risk) you’re willing to accept for the metrics and just run the experiment until the expected loss is below the specified tolerance, as explained in Running a Bayesian A/B test by Chris Stucchio.

This allows us to be a more active participant in the decision making of the experiment by looking at overall uplifts, probabilities and risks and then making a business call, rather than sitting back and taking the results at the end of the experiment as you would do in a Frequentist context.

#4 Choose a minimum duration

Due to the cyclical nature of our business (weekly shopping), we recommend choosing a minimum duration that captures a full cycle.

We don’t run experiments for shorter than a week, and we always run experiments for full cycles to rule out any cycle effect.

Photo by Towfiqu barbhuiya on Unsplash

#5 Consider whether you’re constrained by a maximum duration

Thinking about having a maximum duration is a healthy exercise we don’t do often enough.

Consider whether you have an external deadline for communicating results when planning your experiment. If so, set that as the experiment’s maximum duration. To draw conclusions, compare your previously defined acceptable risk for the metrics you are analyzing to the one provided by the experiment on the given end date.

Another option would be to set a time when you want to make a decision, or iterate on the treatment because you want to do something else sequentially (rather than waiting for another week). Learning from a short iteration, performing some small tweaks, and starting a new iteration is a very good practice in lean product development.

You also need to be aware that there are bigger risks if you run the experiment for a shorter time. If you have time and can afford the cost, you could always add another cycle (see #4), since it’s not going to harm your results and you’ll be surer as metrics stabilize and risk is reduced.

#6 Decide what to do during the first week

If you’ve followed the previous points and you have an idea for a minimum and maximum duration of the experiment, it’s time to launch.

You should define a minimum number of conversions per metric before showing any results to stakeholders. Also, if you see very negative results or data quality issues early on, it might make sense to just kill the experiment right away.

#7 Decide when to stop the experiment

To quote Evan Miller, “With Bayesian experiment design, you can stop your experiment at any time and make perfectly valid inferences.” (From How Not To Run an A/B Test.)

This affirmation is very nuanced, though, because it doesn’t mean the Bayesian approach is immune to the peeking problem as explained by David Robinson in Is Bayesian A/B Testing Immune to Peeking? Not Exactly.

The general rule of thumb for when to stop an experiment using the Bayesian approach is to wait until the risk of choosing the treatment is below your acceptance threshold. If that doesn’t happen, and you’re constrained by a maximum duration, then it’s time to stop the experiment and collect its findings.

Using GrowthBook to set A/B test duration

We use GrowthBook and its Bayesian engine to run and monitor A/B tests at Oda. There are several GrowthBook features that have helped us define test duration (or better, know when to stop an experiment) that we think are worth sharing:

1.- Every metric page has historical data that helps monitor baselines and its variation (mean versus standard deviation). We use this to check if the metric has a high variance or not.

2.- We’ve standardized and grouped company topline metrics under the same tag. When planning an experiment, you can add all metrics under a tag; this helps at making sure all experiments always monitor these same important metrics for the business.

3.- Acceptance risk thresholds can be modified per metric in GrowthBook in the metrics advanced options. In the beginning we used GrowthBook defaults, but reflecting on these acceptable thresholds per metric, using them in the experiments, communicating and being consistent about them, etc. is something we try to do more proactively.

4.- GrowthBook allows you to define the minimum number of conversions per metric in order to start showing results (configurable in the advanced settings of metrics too). The minimum defaults to 150 but we’re working on finding realistic (and much higher) numbers for each of our different metrics depending on its nature.

Final words and next steps

The seven principles explained here are a sensible approach to get to the right test duration that also reduces the peeking problem in a Bayesian context. In the end, it’s all about being comfortable with the amount of risk you want to accept per metric.

In the future, we believe having a great meta-analysis of previous experiments will help define guidelines per metrics when it comes to acceptable risk, minimum conversions before showing results, etc. Our next step is to consolidate this meta-analysis so that we can learn from and use previous experiment data to accelerate decisions when planning an experiment.

The guidelines we’ve shared here have helped us begin standardizing practices across the organization about when to stop an A/B test in a Bayesian context, and we hope they can be useful to others on the same journey. We encourage you to comment below and share your learnings and perspectives on Bayesian A/B testing, so we can learn from each other and foster a collaborative community.

--

--

Xavier Gumara Rigol
Oda Product & Tech

Passionate about data product management, distributed data ownership and experimentation. Engineering Manager at oda.com