How to run an A/B Test?

Jay Arora
Geek Culture
Published in
7 min readJun 7, 2021
Source: https://www.optimizely.com/optimization-glossary/ab-testing/

At Bing, a small headline change an employee proposed was deemed a low priority and shelved for months until one engineer decided to do a quick online controlled experiment — an A/B test — to try it out. The test showed that the change increased revenue by an astonishing 12%. It ended up being the best revenue-generating idea Bing ever had, worth $100 million. That experience illustrates why it’s critical to adopt an “experiment with everything” approach, say Kohavi, the head of the Analysis & Experimentation team at Microsoft, and Thomke, an Harvard Business School professor¹.

Let’s understand how A/B testing works. A/B test is a way to compare two versions of something to figure out which one performs better. A/B test can be used to test a variety of scenarios such as:

1. Which version of the webpage leads to higher conversions?

2. Which marketing email results in more user engagement?

3. Is the drug effective in treating a certain illness?

And many more!

Only those scenarios are good problems for an A/B test where a user is impacted individually. It is hard to understand the impact of the test if there are network effects among users. An example would be doing an A/B test on users of a dating app since their behaviour involves matching and interacting with other users.

Running a successful A/B test requires a careful consideration and choosing of a response variable which will help understand the impact of the change being tested. The response variable should ideally be a KPI or another related metric that can be easily measured.

Let’s consider an example of an app². See below two samples of data for the app. The first one is the demographic data which lists all the app users. The second shows those users who see the paywall. The purchase column indicates whether the user makes a purchase. Not all app users see a paywall since they might be using only the basic functionality or not using the app enough.

Demographic Data²
Paywall Data²

Merging the two datasets will give us a list of all users who see the paywall. The purchase column indicates whether they make a purchase.

The app team is considering two different messaging options for a consumer paywall to understand which option generates more revenue.

Current paywall: “We hope you are enjoying using our app. Consider becoming a preferred member to access all features.”

Proposed paywall: “To access all features of the app, become a preferred member.”

Before they run an A/B test, they need to decide on the KPI for the test. The primary goal of running the A/B test is to increase revenue. The overall revenue can vary significantly based on the type of service, seasonality and different price points. When the revenue varies significantly or depends on multiple factors, it is advisable to pick a metric that is more granular than overall revenue and yet a good proxy for revenue. In the app example, we will consider ‘paywall view to purchase conversion rate’ as the metric to evaluate our A/B test. There are other metrics that can be considered for the test. It is important to pick a metric which is granular and relatively stable over a period.

In the app data, the baseline conversion rate is 3.47%.

Daily purchases = 3181.8

Daily paywall views = 91731.9

Conversion rate = Daily purchases / Daily paywall views = 0.0347

Now that we know the baseline conversion rate, we need to decide on the test sensitivity i.e., the amount of increase in our conversion rate that would be considered meaningful. The chosen value of sensitivity depends on the business context, the type of proposed change and the historical variation of the chosen KPI. For the app, having a good knowledge of historical changes in daily purchases and daily paywall views is helpful to determine the sensitivity. We can calculate increase in daily purchases for different sensitivity values.

A small change on an app such as a change in paywall messaging is likely to give us a smaller lift compared to launching a new offer with a promotion. For the purpose of this exercise, we will choose a sensitivity value of 10%.

Once we have chosen the sensitivity value, we need to calculate the sample size required for the control and the test groups for the test. The sample size depends on various parameters such as the confidence level, statistical power and the sensitivity of the test. The confidence level is the probability of not rejecting the null hypothesis when it is true. The commonly used values for confidence interval are 0.90 and 0.95. Statistical power is the probability of finding a statistically significant result when the null hypothesis is false. We will use a statistical power of 0.80. Explaining confidence level and statistical power in detail is beyond the scope of this post.

We can use a python function to calculate the sample size. See below the function implemented in python.

Sample size function in python²
Power function in python²

With a confidence level of 0.95, statistical power of 0.80, baseline conversion rate of 3.47% and lifted conversion rate of 3.81%, we get a sample size of 45,788. There will be experiments when it is not possible to have such a sample size. There are ways to reduce the sample size required for the test.

1. Choose a unit of observation with lower variability. In this example, if we had chosen revenue instead of conversion rate, our sample size would have been even bigger.

2. Exclude users irrelevant to the process/change. In this case, we are excluding users who have never seen a paywall. Including them would have introduced more variability and required a bigger sample size for the test.

3. Choose lower values of confidence level and statistical power. Doing so would reduce the sample size.

As the next step, we need to pick our control and test groups that are comparable in size and randomness. The randomness can be checked by comparing the user characteristics such as demographics, user behaviour across two groups. When the chosen groups are not random, it is not possible to attribute changes in KPI to the change being tested. Once we have checked the control and test groups for randomness, we can run the A/B test.

After running the A/B test, let’s have a look at the conversion rate for the two groups to compare the results.

Source: DataCamp. V refers to the test group. Conv is the conversion rate.

The conversion rate for the test group is higher but we want to check whether the observed difference is statistically significant. To check the statistical significance, we start with the null hypothesis i.e., there is no difference between the control and the test group. To accept or reject the null hypothesis, we calculate the p-value.

Without going into too much detail, we can remember that if p-value is less than 0.05, we can reject the null hypothesis and conclude that the two groups are different. Calculating the p-value for this A/B test gives us a value of 4.25e-10. Therefore, we can reject the null hypothesis and conclude that the observed lift in the conversion rate for the test group compared to the control group is significant. The p-value can be calculated using the function below.

p-value function in python²

The A/B test results can be summarised as below.

A/B test results

We can also consider a few plots to visualise your results. Few plots that can be useful:

1. Plots showing the conversion rates for two groups

2. Plot showing the estimated lift along with the confidence interval.

Once we have established that the observed lift is statistically significant, we can go ahead and implement the change across the entire user base. After implementing the change, it is a good practice to monitor the KPIs to ensure that we observe the expected change in KPIs. An unusual change in KPI could indicate a problem in how the change has been implemented for a group of users.

Running an A/B test can be a good way to get an answer to a business question quickly. A great advantage is that if we run the test and it doesn’t work, only a small number of users are affected and we can revert back to the old tactic quickly.

References

1. https://www.hbs.edu/faculty/Pages/item.aspx?num=53201

2. ‘Customer Analytics and A/B testing in python’ course on DataCamp. https://learn.datacamp.com/courses/customer-analytics-and-ab-testing-in-python

3. https://hbr.org/2017/06/a-refresher-on-ab-testing

4. https://www.optimizely.com/optimization-glossary/ab-testing/

--

--