Conquering A/B Testing (3): How long should an A/B test run for? What should be my sample size?

Jamie Lee
Hackle Blog
Published in
6 min readApr 1, 2022

This article is the third post of Hackle’s ‘Conquering A/B Test’ series. This series deals with the common questions many people have with regard to the entire process of A/B test design, preparation, result interpretation, and final decision-making. To check out the first post in the series: “When should I conduct A/B testing?”, click here.

Hackle’s Conquering A/B Testing covers the following topics:

1. When should I conduct A/B testing?

2. How should I set my metrics for an A/B test?

3. How long should an A/B test run for? What should be my sample size?

4. How should I set the user identifiers?

5. Can we deduce a causational relationship from an A/B test?

6. When is the right time to stop an A/B test?

7. How can I reach my conclusions with ambiguous A/B testing results?

8. What should I do if I want to restart an A/B test?

9. How should I conduct review processes within my organization?

***

During this post, we will cover one of the most common questions we get asked by many of our clients who are looking to get started with A/B testing: determining the minimum required sample size and time duration for A/B testing. How long should my A/B test run for? What should my sample size be?

The simple answer to this question is, ‘There is a recommended standard, but the numbers are unique to each service and company.

This post will provide practical guidelines from the Hackle dashboard for not only those interested in A/B testing but for data scientists who are new to experimentation.

Required Minimum A/B Testing Sample size: T-Test and Statistical Tests

What if your sample size is too small? The t-test approach is a popular technique used to calculate statistical significance for situations where the experiment results do not follow a normal distribution due to a small sample size. Hence in statistics, even in A/B tests with a sample of less than 30, significance is verified through a t-test. However, in most cases, A/B tests that are conducted on web and mobile applications are generally conducted with a sample size larger than 30, making it unnecessary to worry too much about the lack of sample size.

However, it is important to note, the smaller the sample size, the larger the gap or difference in metric results of the A/B test has to be between the test groups to confirm statistically significant results.

Required Minimum A/B Test Duration: Time to complete a conversion cycle

When setting the time duration of running an A/B test, you usually take into account the length of time required for the sample size to reach a significant result, and also the period of time required to eliminate various external variables that may occur at certain points of time.

What if your sample size is too small? For services with a lack in the number of users, the experiment period can last up to 3–4 weeks. What if there is a lead time of around a week on average for users to make the final purchase decision after being exposed to the new changes from the A/B test? You can set the test duration to 2 weeks to ensure sufficient time is given for a possible user conversion to occur. If you expect a large range of movement within the metric results, the period may be shortened.

As such, there is no set answer for setting the duration of the A/B, but it is generally recommended to conduct A/B tests for at least 1 week as the minimum requirement.

There is no such thing as a “Failed A/B Test”

Consider this situation:

You are conducting your A/B test based on your predetermined sample size and test duration. However, as you continue to run your A/B test and your sample size increases, you realize that your accumulated results still do not show any statistical significance.

Could this be called a failed A/B test? The answer is: ‘not necessarily’.

Going back to when the A/B test was initially planned, the purpose of the A/B test was to measure the user’s reaction to a new change. The fact that there was no statistically significant difference in the indicators of the two groups means that users did not particularly prefer one option, and such a result can itself be a lesson learned.

The next step to take is to segment the results into specific user groups.

Although there was no statistically significant difference when the two test groups were viewed as a whole when users of each test group were divided into specific user segments (ex. based on membership level, region, age, etc.) and analyzed, you realize that different behavioral patterns occur with respect to a specific user group and statistical significance can only be detected based on user preferences that apply to a specific segment of users.

On top of this, such segmentation analysis can find you discover bugs or usability errors that only applied to users using a specific platform (ex. Android).

(Source: Hackle dashboard)

In the end, despite whether or not you derived statistical significance, each and every A/B test result is able to convey a certain message. Based on the lesson learned from a previous A/B test, you will be able to design a better A/B test in the future. This trial and error process of conducting various A/B testing is a part of an iteration process that will allow you to lay a foothold on your very own agile development and better understand your users in the long run.

A/B Test Results on the Hackle Dashboard

The Hackle Dashboard provides the following calculation results and p-values for the metrics (ex. membership subscription rate) of each test group in an A/B test.

(Source: Hackle dashboard)

From the above screenshot, we can see that test group C had an improvement of 21.59% compared to the control group (test group A). However, the p-value was about 0.2702, which does not generally fall within the range that can be interpreted as statistically significant (0.05 or less). (P-value is one of the important statistical concepts to be aware of during A/B testing, and will be covered in a separate post within the Hackle blog.)

In this case, you can refer to the Bayesian value that directly calculates the probability that the metric results of a specific test group are higher than that of another test group. Bayesian values are also a good form of reference to look at as this is a faster approach to extract statistically significant values in comparison to obtaining statistically significant p-values of 0.05 or less.

(Source: Hackle dashboard)

Looking at the image above, the Bayesian probability that test group C is better than the control group (test group A)is 87%, from this information you can make the decision to end the A/B test with test group C as the winning group by considering the calculation results from various viewpoints. (Of course, if other important metrics other than the membership subscription rate indicator deteriorated in test group C, another test group may be selected as the winning test group.)

In summary, A/B testing is a tool that allows you to understand your users better. Rather than spending too much time thinking about the minimum sample size and A/B test duration, get started on your first A/B test to accumulate data and lessons learned on your user’s actions and behavior.‍ Ending the A/B test when it is determined that you have gained enough learning and preparing an iteration that reflects that content can be a shortcut for your organization to grow much faster and move forward.‍

‍Check out Hackle at www.hackle.io in order to start creating your own A/B tests ‍to release the right features to maximize your customers’ experience.

--

--