Unlocking A/B Testing — Part 2: evaluation and power analysis

Claudia Minardi
NEXT Engineering

--

If you have followed all the steps we made in Part 1: set up and execution, you have successfully started the execution of your controlled experiments in the form of A/B tests, and you’re recording all the data you need to draw some interesting conclusions.

Let’s take a look at how to do this!

Evaluation

This step can be intimidating for many people, as there is a lot of statistics involved. Thankfully, a lot of frameworks and programming languages offer out of the box implementations of Student’s T-test, into which you just have to feed your data and get out the results.

However, sometimes, even when you look into the existing solutions, they have parameters up for tweaking that sound just like a foreign language to you. What the heck is a confidence level?

Let’s start from the first thing you need to know about the T-test: the hypothesis. Now we know that at the end of the experiment we will have two sets of data: the one we recorded from population A (control), and the one we recorded from population B (treatment). Your hypothesis is that these two populations are different in a statistically significant way , that is to say that there is an observable difference between the two, and it’s not just by chance. If the T-test succeeds, if they are statistically different, it means that the change that you introduced had an effect.

So, going back to the red/blue button example: does this mean that your new red button is better than the old blue one? Absolutely not! To infer that you need to look at the actual data. If population B clicked on your red button more times than population A clicked on their blue one and Student’s T-test records a statistically significant difference, then congratulations! Your new feature is a success!

Let’s go through a few more obscure terms you’ll need to keep into consideration:

  • Confidence level. How confident are you that you are drawing the right conclusion? Nobody can be right all the time, but being allowed to make BIG mistakes is probably something you don’t want either. The higher the confidence level, the harder it is to find a statistically significant difference. Best practice is to keep the confidence level at 95%, which implies that 5% of the time we will incorrectly conclude that there is a difference between A and B when there is none.
  • Standard error. It’s commonly defined as the standard deviation of the mean of a population. In simpler words it is an estimate of how far the sample mean is likely to be from the population mean. In an ideal world, you want a very low standard error. Kohavi et al. (2009) propose a few techniques to reduce the standard error, such as using a large sample size (for example obtained by running the experiment longer), or excluding the users that are not exposed to the variants, in order to filter out the noise.
  • Power, which is the probability of determining that the difference between treatment and control is statistically significant. This, as we will see later, might influence the amount of data you need to collect during your experiment. Best practice is to keep the power between 80% and 95%.
  • Sample size, which according to Kohavi et al. (2009) refers to the number of impressions you need to collect in each variant (A and B). Remember, you chose the impression (or experimental unit) during the experimental setup! For example, if your experimental unit is a page view, the sample size will be the number of page views you have collected during the execution of your experiment. Be careful, the sample size is collected for each variant! This means that if you have a total of 1000 page views, and A and B have totaled the same amount, your sample size will be 500.

When your experiment stops running, you can gather all the data you have recorded — namely all the OECs you have picked in the experimental setup. You will have in your hands two sets of data: the one produced by the control group (people subjected to variant A), and the one produced by the treatment group (people subjected to variant B). It’s time to feed these two data sets to the test we have chose to determine statistical significance.

As said during the first part of this tutorial, we have chosen Student’s independent (unpaired) two-sample t-test. This one is suitable to our situation since the two populations are distinct: the treatment and the control group are composed of different individuals, and the two sets are disjoint.

The result of this test will tell you, for the selected OEC, if the difference
between the two populations is just the result of chance, or if it actually holds statistical significance. From this you can infer whether your experiment was successful or not!

Power Analysis

Now, there is one more secret left to share: now you know how to set up, start, and evaluate your experiment. But when do you stop running it? How much data is enough to reach satisfying conclusions?

Most importantly, let’s imagine you have not reached statistical significance. Was it all in vain, or just running the experiment longer will give you the answer you were looking for?

While you wait for your experiment to run its course, what you can do to answer all these questions is to perform a power analysis. As we’ve said before, the power is the likelihood of finding statistical significance: in short, it tells you exactly what you need to make the magic happen.

Given that we’ve used Student’s t-test for evaluation, our power analysis will be based on the same test — again, there are many out of the box solutions for power analysis, just choose the one that suits your need! In our case, the two-sample t-test power helps you figure out any of the following:

  • Sample size, as in how many impressions you need to reach the desired statistical significance
  • Significance level, as in the confidence level with which you want to determine statistical significance
  • Power
  • Effect size (or Cohen’s d), as in the amount of change we want to detect, the difference between the control and the treatment. According to Cohen (1977), it’s good practice to have an effect size of about 0.2 — this means you are able to detect even small changes with good statistical significance.

The power analysis is a clever trick that can help you in many occasions: significance level, power and effect size can be set to their default/best practice values of respectively 0.05, 0.8 and 0.2, and the power analysis will return the value of the sample size you need to gather!

Alternatively, let’s suppose you already know your sample size — for example, you have a limited amount of time to run your experiment, and you can estimate the number of impressions. You can now use the power analysis to determine what confidence level you’ll be able to reach! Or to estimate the effect size you’ll be able to measure!

In short, it gives you a good estimation of the value of your tests, even before you have the actual data for evaluation.

To wrap it up

Now you have all the pieces of the puzzle that is A/B testing, from set up to execution, to evaluation. There are a lot of different approaches and tests that fit into the best practices of A/B testing, and if you feel like we missed something important, we’d love to hear it!

In the meantime, happy coding!

Photo: paimei01

--

--