Before you start an A/B test

5 min readJan 26, 2022

image source: https://www.istockphoto.com/illustrations/a-b-testing

Opportunity sizing

An A/B test is often proposed by a product manager or a product analyst. When there is a new feature proposed by a product manager/business or sometimes when there is an obvious feature gap, an A/B test is proposed. Alternatively, when a product analyst analyzes retrospective data and combines it with learnings from past A/B tests, they can come up with ideas that can be tested by doing an A/B test.

For example, personalization modules are very common these days on many ecommerce websites. They are designed differently on different websites. The variations could be in terms of strategies for an item selection (items can be shown based on visitor’s past purchase history, or their browsing history etc.), way of engagement with the module, or availability of upfront information like checking for delivery pincode, delivery date etc. The design of such modules are typically improved over the time in a test & learn fashion where new changes are tested before a full roll out.

So, what is the first step in an A/B test?

The first step is creating a test plan. Once, an A/B test has been proposed and agreed upon to be tested by all stakeholders, a test plan is created. A test plan is like an agreed upon document between analyst and product managers. It covers -

Objective of the A/B test
The hypothesis around what change it is expected to bring
A list of metrics for measurement — including primary metric, financial metrics and other operational metrics/deep dive metrics
Launch decision i.e. conditions under which it will or will not be absolutely launched
Finally, the estimates for how long the test will run, what sample size do we need, what would be the minimum detectable effect (MDE) size that we expect the test to determine

Before running a Power Analysis

The next step is running a power analysis to either arrive at sample size (when MDE is known) or arrive at MDE (when sample size is known). Before running a power analysis, we decide upon few things like significance level (typically called as alpha and is usually between 5%-10%), power (beta and typically chosen at 20%), treated flag etc.. Alpha is an indicator of type 1 error and beta, an indicator of type 2 error.

A quick diversion to type I vs. type II error

Type I error means we reject the null hypothesis when it’s actually true, and type II error means failing to reject the null hypothesis when it’s false. In an A/B test, we want to ensure that we are able to pick up changes that matter to us but at the same time, we don’t want to pick up unwanted changes. Therefore, we always try to minimize type I error (probability of getting the results that we have by chance) and maximize the power (probability of being able to pick up the change that actually exists) for our test.

Next, we decide on the treatment flag. For example, if we are running a test on cart pages, we would not want to consider our population to be anyone coming to the website in general because that will highly dilute the results. Rather, we would filter for only those visitors who visit the cart page in both test and control experiences.

Power Analysis

Using the historical data (usually using a latest time frame, unless it’s seasonal time), for the treated population, we can calculate few things like conversion rate, standard deviation of conversion rate, per day visitor count etc. Using this set of metrics, we run a power analysis. It gives us a sense of how long to run the test, what is the sample size that needs to be collected and what would be the change that the test would be able to pick up every incremental day.

When the power analysis is done on historical data for a selected time frame, we already know the sample size and number of days and using that information we can get a sense of how the MDE might change for each incremental day. Some companies might already know what is an ‘acceptable’ MDE, so that information can be used to arrive at sample size. We can either use MDE to get sample size or vice versa, but cannot calculate both at the same time.

There are many online calculators available to get sample size. To get MDE estimates using power analysis, there are in-built functions available in R and Python.

image source: https://www.abtasty.com/sample-size-calculator/

How long to run the test?

Typically, it is advised to run an A/B test for at least 2 weeks to control for any variations due to weekday vs weekend, any seasonal effect etc or till the required sample size is reached. We may run the test even longer sometimes to collect more samples (in case significance is not reached), but keep in mind that the longer you run the test, the more your results are prone to return visitor bias (when one experience may have higher percentage of returning visitors causing the overall sample size bias).

Parallely, the analyst can initiate conversations with the engineering team/instrumentation team for getting appropriate tags in place for agreed upon metrics.

Once we arrive at the estimates, and all the tags are in place, we are ready to start the test!

Closing thoughts

Thanks for reading through my learnings! Kindly, let me know if anything is unclear and leave your questions in the comments section. In my next few posts, I plan to cover how to analyze and conclude A/B test results.

Source: Udacity’s Course on AB Testing