Maria Paskevich
May 28 · 7 min read

A/B testing is a very popular technique for checking granular changes in a product without mistakenly taking into account changes that were caused by outside factors. In this series of articles, I will try to give an easy hands-on manual on how to design, run and estimate results of a/b tests, so you are ready to go and get these amazing statistically significant results!

In the first article, we will talk about approaches that are the most suitable for designing experiments and estimating the number of users per group needed to be sure in results.

Let’s say you have a game and are trying to increase retention rates by adding an additional gameplay mode. So, you spend some time and effort adding this new mode. At the end, you see that 10% more people have started to churn after the first day. Looks like these changes weren’t well received! And now investors are baying for blood. Word on the street is they are starting to wonder if is this the right time to give the old heave-ho to the CEO? Or to the product manager? Or maybe to the artist (they all disliked this new color on the menu button anyway)? But at the same time, your game has been reviewed by a very popular YouTube blogger. This has helped you get an additional 500k installs just in 2 days. However, the overall quality of this traffic is supposed to be much lower than usual, primarily due to people watching the review, and then trying out the game for the heck of it. This is likely not the customer segment that would be highly interested in the genre of your game.

How to be sure that the changes in retention rates were caused only by the changes within the product and don’t have anything to do with that instant increase in installs? Or maybe it could have been even worse if the game doesn’t have that new mode? The best way to check here would be to run an a/b test for a new feature, releasing it for only a part of the audience and keeping the second part as a control group.

How can you be sure that the changes in retention rates were caused by the changes within the product and don’t have anything to do with the increase in installs driven by the popular blogger? Maybe it would have been even worse if the game didn’t have that new mode.

In this case, the best way to be sure would have been to run an a/b test for the new feature, releasing it for only a limited part of the audience and keeping the second part as a control group.

In this article, we will talk about approaches that are most suitable for this and similar cases.

Experiment design

It’s always better to think about experiment design before starting an a/b test. This includes several considerations:

1. Formulate the null and alternative hypothesis

Let’s go back to the example with the game: in this case, we are releasing the new mode and expecting that it will change users’ behavior. In other words, there are two possible outcomes after releasing the feature: either it affects players behavior or not.

So, the null and alternative hypothesis, in this case, should be formulated as follows:

H0 — new game mode hasn’t changed anything within the game, so metrics for the players should be from the one population with certain distribution in both groups (test and control one).

H1 — new feature has actually changed people’s behavior, so metrics have either increased or decreased ( here, you may consider a one-sided test if you are sure about the direction of the effect — i.e. they have either increased or decreased). In this case, you will expect the groups to belong to two different populations with different attributes (mean, standard deviation)

The a/b test will aim to reject the null hypothesis with a certain level of reliability (aka p-value)

2. Plan the metrics you are going to check and the possible outcomes of the test

After formulating the null and alternative hypothesis you already more or less know what to expect, but it is always better to think about the exact metrics to be used in the test. This lets you calculate the sample sizes needed to detect the significant difference.

Usually, there are three possible categories of metrics:

  • A simple case with only two possible alternatives (yes/no, churned/returned, etc.)
  • More complicated case with more than two mutually exclusive alternatives
  • The third category covers continuous variables (an average session time, number of sessions, win rates, etc.)

For the first two categories, the results are expressed as percentages, while the third one is usually summarized in means and standard deviations.

The reason you want to know in advance which category your experiment falls under is because we should use different statistical methods for different types of metrics. The first two types usually require bigger sample sizes than the third one.

In our case of the new gameplay mode, we could go for a bunch of metrics: Retention, Average session time, Number of sessions per player, etc.

Let’s say that the main target, in this case, would be to increase both Retention rates on day-1 and average session time. This means we have one metric of type 1 (returned/churned) and one continuous variable (session time).

3. Estimating sample sizes you need to choose for the test

As soon as we have decided the target, it’s easier to estimate the sample size needed to spot a statistically significant difference.

And for the different metrics’ categories there are different approaches to this problem:

3.1 Confidence interval for a proportion

This type is applicable to the first and second types of metrics defined above: if we are looking for a change in proportion between groups with acceptable precision.

The general formula is as follows:

In this formula, n is required sample size,

p is the hypothesized population proportion, Z 1-𝛼∕2 is the value from the standard normal distribution table corresponding to half of the alpha level (in other words, it’s the probability of rejecting the null hypothesis when it’s true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference), 𝜔 is half of the desired confidence interval.

For the example with Retention on day-1 we can estimate sample size:

Let’s say we know that current retention is 40% and expect the new game mode to increase it by at least 2% (our confidence interval, in this case, will be 4% — 2% above and 2% below the estimate), which means p=0.4 and 𝜔= 0.02 with 𝛼=0.05, so Z 1-𝛼∕2= 1.96

So, in our case there are should be at least 2305 users in the test group to be sure that a 2% difference in day-1 retention is statistically significant.

3.2 Power for the test of the difference between two sample means

This type of estimation can be used in cases where the target of a test is a continuous variable. In our example, it will be the average session time.

The formula looks like this:

Where Z-values for both a and b depend on 𝛂-level (just like in the previous example, Z 1-𝛼∕2 = 1.96 with 𝛂=5%) and the level of the statistical power 1-𝛃 (it ranges from 0 to 1, and as statistical power increases, the probability of wrongly failing to reject the null hypothesis (so-called type II error) decreases. For a type II error with probability equal to β, the corresponding statistical power is 1 − β). We will compute the sample size required for 80% power, so Z1-𝛃 will be 0.84.

𝛔 here is the standard deviation and it can be calculated using the following formula (you can use available historical data in the control group):

𝛅 is the effect size and it equals the difference between test and control groups divided by the measure of variance (𝛅).

So, let’s say, in our example, we know that the average session time is 8 minutes and we expect the new feature to increase it by 1 minute. Also, from historical data, the standard deviation for the average session time is 4 minutes.

In this case:

Which means we’ll need 250 users in the test group to be sure that the 1-minute difference of the average session time is statistically significant.

As we see, for the retention test there are many more users needed (2300 vs 250), so the right sample sizes for the test should be taken from that estimation.

In the next article, we will talk about the results of a test and statistical techniques suitable for different situations. Stay tuned!

The Startup

Medium's largest active publication, followed by +489K people. Follow to join our community.

Maria Paskevich

Written by

Data analyst, work with mobile games.

The Startup

Medium's largest active publication, followed by +489K people. Follow to join our community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade