The AB testing Cookbook- Part 2

Ibtesam Ahmed
7 min readAug 16, 2023

--

This article is second in my series called, “The AB testing Cookbook”, where I planned to give a comprehensive step-by-step guide to AB testing.

In the first article I talked about the need to AB test and why business stakeholders have to be convinced that AB testing is the only scientific way to make changes to a product.

If you are on this article I am guessing you want to know more about it, so without wasting any more time, let’s start learning the fundamentals first.

Step 1: Formulating a hypothesis

This is the most easy and natural way to start AB testing. In fact this is something that we do in our daily lives to make decisions. Let’s take an instance, after leaving your dog alone in the house when you get back you find your trash can spilled. You formulate two hypothesis, either someone broke in the house and spilled the trash can or your dog spilled it. Knowing the odds of the two things happening in the real world, you know which is a more likely hypothesis.

Moving on to a more serious example, let’s assume you are part of the marketing team in a company responsible to bring customers on the app through campaigns. You have been debating with your colleague about the effect of emojis in promotion messages. While you think they are beneficial and make the messages more millennial, your colleague thinks they make the message look more spammy. You want to test this out. So you lay out a null hypothesis which says adding emojis won’t change the click through rate of your messages and an alternative hypothesis that says it will change it.

This change can be bidirectional, it can either increase it or decrease it, so we will do a 2 tailed hypothesis test.

Step 2: Deciding the randomisation unit and splitting into groups

After you have formed your null and alternative hypothesis, the next step in your journey would be forming a control and treatment group. In the control group the messages would remain as they were until now, but in the treatment group the messages would now have emojis.

How will you form these groups is a question that remains to be answered. Let’s say you take your user base and split them randomly into 2 groups, doing a 50/50 split. The users here is the randomisation unit, which will basically ensure that the two groups are exactly the same except for the treatment. You could also do an uneven split, like 95/5 but there are some tradeoffs associated with doing that, which I’ll explain later.

When you are randomly dividing users you could use user id as the randomisation unit because it is stable through time but then you’ll only be able to run your test for logged-in users. There can be other randomisation units too like session id or a page view id that won’t require a log-in but can lead to the same user being assigned to different variants at different times. As you can see, there are tradeoffs here and you should choose what works best for you.

Step 3: Deciding the metric and the time duration for the test

The metric that your treatment will move has to be decided pre-hand between all the stakeholders based on the goal that you want to achieve. In the above example of adding emojis to messages the goal would be to increase click through rates, app open rates and thereby conversion. The primary metrics used in AB tests are also called driver metrics. They are sensitive to change, actionable, easy to measure, and attributable to the treatment. Like for the above example we cannot use daily active users as the metric since it can be influenced by many other factors besides the campaign’s effectiveness.

Apart from driver metrics, there are also guardrail metrics used to guard the new treatment from harming the business. For example if you have added a new feature to the home page of your app, the time to load the app would be a guardrail metric as it should not go above a certain threshold or you will start losing users. The customer’s engagement with that feature itself is a driver metric.

The duration of the test should also be decided during the experiment design phase and before running it. Not doing this can lead to false positives and introduce bias in the results, like while doing AB tests one stakeholder can pre-maturely stop the test when they see the metrics align with their hypothesis.

The time duration has to be decided keeping in mind the weekend effect, seasonality and novalty effect. People tend to make more purchases over the weekend, therefore an experiment should be run at least for a week. Holiday season specially days like black Friday influence user behaviour in ways that cannot be generalised to non-holiday days.

So, until now I have covered the more theoretical part, the next step is going to be a tad more statistics heavy.

Step 4: Power analysis

There are two types of errors we need to account for while running AB tests which basically means deciding a threshold of these errors acceptable to us. Let’s talk about the errors first.

Significance Level/Type I Error/ α = P(rejecting null | null True)

The type I error basically gives you the probability with which you will reject the null hypothesis when it was actually True. We would want to keep this as low as possible. 0.05 or 5% is the standard, which translates as, 95 out of 100 times we won’t reject the null hypothesis when it’s true.

Type II error/1 — Statistical power/β = P(fail to reject null | null False)

On the contrary, the type II error gives you the probability with which you fail to reject the null hypothesis given the null hypothesis was false. You would want the type II error to be less which would make the statistical power high. We can define statistical power as the probability of accepting the alternative hypothesis when the alternative hypothesis was true. The standard here is 0.8 or 80%.

You might be wondering why do I need to decide α and β before running the test. The answer is, you would want to decide these pre-hand to calculate the minimum sample size needed to run the AB test.

let’s break down the formula bit by bit:

  • n : the sample size we want to calculate
  • α and β are already mentioned here, the formula just uses their z-scores.
  • σ : estimate of variance in control group. This can be estimated from historical data. In case there is no historical data, you can run an A/A test first to get variance.
  • δ : this is the practical significance or the minimum detectable effect agreed upon by stakeholders.

Practical significance is pretty important and can be overlooked in experiment design so I’ll elaborate on it a bit more. While doing an AB test, this is the minimum difference you would want between control and test, that actually makes it practical to scale-up the treatment given in test. In other words, it tells you that the effect is large enough to be meaningful in the real world.

Now, let’s look at the relationships between different variables in the above equation.

  • n and α are inversely related which means for a smaller α or greater confidence level you will need more samples.
  • n and β are again inversely related. For a greater power in the test you will want more samples.
  • variance and n are directly related, if you have more variance in control, you will need more samples.
  • δ and n are also inversely related. To detect a smaller size with the same level of confidence, you’ll need a larger sample.

While actually running an AB test, you wouldn’t have to calculate this by hand, there are many online calculators for getting the sample size but nonetheless, it’s important to know the relationships between these variables and how they affect each other.

There’s another important topic called confidence intervals. They are helpful in quantifying the uncertainty around p-values and gives a range in which your p-value would lie with a certain level of confidence. I have written another article on this, so I want go in too deep on this topic here.

Before wrapping up this article, let me answer an essential question that everyone doing an AB test deal with.

Should I keep the split even or uneven?

Ideally, you should keep the split 50–50 because an even split leads to the best variance reduction and therefore the highest statistical power. Your test would reach statistical significance much faster if you keep an even split. A 95–5 split makes more sense in a holdout experiment where you are scaling up a new feature cautiously and monitoring it’s impact instead of an AB test where you are making a decision between variants.

The plot shows the effect of different splits on the time it takes for your experiment to reach the right power. 50–50 split reaches the highest power pretty quickly.

For more details on this topic, please refer this amazing article.

That is it for this one. In the next article I am going to be talking about the type of statistical test to choose based on the metric that you are tracking. You can find the next article here.

See you there!👻

--

--

Ibtesam Ahmed

Full time Data Scientist. Avid Reader. Moody Writer. Amateur Cook.