AB testing in real life step-by-step

15 min readAug 8, 2022

If you are here, you’re interested in AB testing and want to learn how to perform them with all the criterias you need to take into account in real life, not skipping points which usually are ignored in other examples.

To know about AB testing is a valuable skill in almost all the industries. Here we will cover many topics, some others will be covered in subsequent articles, so feel free to check out if I published already the next part or reach me out if you have any doubts.

Finally, at the end you will find a link to a repository with the code on GitHub.

Introduction

Here we will explain how to perform the analysis of an AB test. Usually, when people speak about AB testing, they discuss which statistical test is the best to define if an AB test was successful or not. Starting thinking about the statistical test when we think about AB testing in an error. Indeed, even thinking about “successful” and “non-successful” AB tests is a little annoying. These are just two of many common errors which I noticed with the time.

We will do a complete and detailed analysis of an AB test here, but it’s important before of that to do some reflection. Before we start to code even one line, we need to have the right mind set, we need to understand what the target of any AB test is. Let’s start with some basic points:

Well-designed AB tests are successful in any case, even if the outcome is not the expected one. What matters are the learnings, a well-designed AB test will allow us to plan the next steps on top of it, again, even if the outcome is not the expected one.
AB testing is a tool, not a stopper. I often heard different people complaining about the time AB testing takes and how it is delaying the implementation of changes. The problem is that once a new change is implemented if something goes wrong, we are not able to be 100% sure if it’s because of the last change or not. Indeed, even if everything goes well, we will not be able to ensure that it happened because of that new change. Correlation and causation are two different things.
In the theory AB testing is mostly about the statistical techniques to evaluate different variants and proof or reject a hypothesis. In real life the things are a little different. Indeed, I would say that in any data related field one of the major subjects is data cleaning and preprocessing. This is especially relevant when we think about how real AB tests work. For instance, to apply different treatments to different users when we speak about mobile applications or websites, we need to use special tools which allow us to do that, but these tools are not perfect. So, if we don’t prepare the data taking that into account any further steps will drag an error.
Usually, at a company you won’t have all the time you would like to investigate all the different options and ways you have to do AB testing. Also, building a pipeline to automate this takes time if you think now about adding dashboards so that non-technical users can look into results, it takes even more time. Because of that, in many cases I would suggest starting with simpler approaches, if we don’t really understand why we are selecting a more complex approach instead of a simpler one we are doing something wrong. As a base rule, the easiest approach which meets our needs is the right option.

We have now an idea about what AB testing is, but let’s summarize all this into one concept, what is AB testing about? AB testing is about investigation.

We test something to learn from it, we start with an initial hypothesis, but only the real experience will tell us what reality is. Indeed, well-designed AB tests don’t only evaluate one thing, they help us to understand better the subject of study, allowing us to plan next steps.

One of the questions you could have is, what is a well-designed AB test? With the previous explanation we got a feeling about what a well-designed AB test is, but this discussion is much deeper and will require his own article which for sure will come.

Statistical tests

If you look statistical test for AB testing at google you will find half a dozen in just 10 minutes. Now, going ahead and selecting anyone of them is enough? Well, you are running a big risk by doing this. All of us surly saw some of the most basic ones at college or the university. If that is your case, hopefully, your professor repeated many times the word “assumptions”. You can not pick any test and apply it to any set of data, at least not if you want to get some useful and accurate insights from it.

I will not go into the details of every statistical test, that is out of scope of this article, but I will give you a general recommendation. Two of the most robust statistical test I would recommend you are the:

Mann — Whitney U test
Kolmogorov — Smirnov test

Quick clarification, when I say robust I mean that they have less assumptions on the data than other statistical tests, and also, they are in general easy to use from the process point of view. They evaluate different things (We will come to it) to define when a change is statistically significant. So, in general I would recommend using them together. They are easy to understand, with some basic understanding about statistics they will not be a challenge for you.

I know that other ways to evaluate AB tests are more interesting, as the bayesian approach, but if we don’t understand why and when to go with the bayesian approach and not with this one recommended here it’s because we need to start with the basics. Again, as a base rule, the easiest approach which meets our needs is the right option.

The analysis

What are we going to do?

Now, after a quick but really important reflection, let’s speak about the analysis we did here. What can you expect from this article? We will face the analysis of an AB test, taking into account additional steps which we need to think of when we do the analysis of an AB test but are usually skipped in most of the examples I saw.

Also, this dataset we will use is already aggregated, but in some cases we will have more granular data, indeed it’s pretty common to work with user level data. So, I will make some comments on things we have to take care of when we work with non-pre aggregated data.

To take into account on user level data

In many cases you will handle user level data, which you might want to aggregate to analyze it in the way I will do here. But, before doing this, let’s think a littel bit about it.

In real cases, specially in technology orientated companies, AB testing is done with some tools which helps us to select some of our users, split them into variants and apply to these variants different treatments. Now, how confident are you that the tool you are using works well? Given my experience I wouldn’t be too confident on this. Challenge your tool, ask yourself:

Is it selecting the amount of users you expected?
Are the users distributed among the variants in the proportion you set up?
Did you apply some filters on which users to include in the AB test? Is the tool applying those filters?
Do you have one user in more than one variant of the same AB test?
Are you users actually getting the treatment we expect them to get?
Do we have enough users in our AB test? Are those users representative for the total population?

Some of those points still applying on aggregated data and we will evaluate them here also.

Our data

Let’s come back to our case, with already aggregated data the dataset I’m going to use is from Kaggle, here is the link so you can download it (Indeed there are two datasets, one for each variant).

One of the first things we need to do when we work with new data is to understand what information we have in it. Here we are working with non-real data, we can’t go ahead and ask any teammates, go through documentation, or make out by ourselves how the data for each column in processed. But, I would highly recommend you to take this step very seriously. Here, given the context, I will define any additional clarification the dataset or related documentation doesn’t provide by itself.

Our study case

A company recently introduced a new bidding type, “average bidding”, as an alternative to its existing bidding type, called “maximum bidding”. One of our clients, non_real_bidding_company.com, has decided to test this new feature and wants to conduct an A/B test to understand if the new feature brings more average revenue than maximum bidding.

Let’s first define the hypothesis they want to test. They want to now if the average revenue increased because of this feature. Of course, we will look also into other metrics, but this is the base criteria for this AB test.

Based on this, let’s start!

Hands on analyzing

Let’s first import our libraries, get the data and join it into one single dataframe.

Initialization

Quick spoiler, here we can see alrady some null values.

Preprocessing

Transform date column

After importing the data we will start the preprocessing transforming the format of the dates to avoid any future error. Also, we will transform this column into the index.

Verify experiment variants

Checking the experiment variants we can see that we have only two variants, the default (or control) one and the testing variant.

Define granularity

Now, we need to make out which is the granularity level of the data. Here we will see that indeed we have 60 rows one for each variant and each day. We can make this out by looking into the shapes as we did below.

This is important as we could have a lower granularity level, for instance, by country, which would mean that we have one row per date, variant, and country. In these cases we can look at the results at country level, or we would need to aggregate the data.

Manage nulls
Nulls are something we have to check for in all the cases, indeed we need to understand why we are getting null values, in some cases it could indicate an error in the process of getting that data. So, if you see a null value don’t just ignore it, at least ask among your team or yourself what is happening, try to understand what that null means. Below we can see that we have a case as such.

The row for the date 2019–08–05 for the control group has only nulls, here we have two options:
- We can drop this row
- Use any technique to fill those values

Initially, I would recommend not to fill these cases if we are not totally sure about what we are doing, we could being biasing the results. Given the risk we would be running and the low potential gain by filling this value I will just drop it. In general I only would fill in values if it’s really necessary.

But, here we need to make a call-out, as we are working with time series if we take out one date for one of the variants we need to consider doing the same for the other variant. If not we could be facing problems applying different
statistical methods to evaluate the results. If you have doubts about what to do I would recommend you to drop the date for both variants.

Normalize metrics
Now, here we have another of the most common errors when evaluating AB tests. This error isn’t covered in most of the examples I saw.
In this experiment we have two different variants, are we sure that the same amount of users have been exposed to each variant? Sometimes it could be that this is not the case, and if we don’t be aware of this we will cause noise in our analysis, or in the worst of the cases totally biase our results, making them worthless.

So, what can we do? Usually one of the metrics in your dataset will give you information about how many users have been exposed to this AB test, a general approach is to normalize all the metrics by this value. In that way you are making sure that different amount of users exposed to each treatment will not biase your results.

The description of the data isn’t that clear about which column tells us how many users have been exposed to the AB test. In real life, if that happens, stop everything and speak with every team member or responsible for the data
to understand clearly what information we have at each column. As we can’t do this here, and based on the information we can find in Kaggle about this dataset, I will assume that the column “Reach” defines the amount of users exposed to this AB test, so I will divide the rest of the columns by this value to normalize the metrics.

Also, let’s check quickly if we can make out if indeed different amount of users have been exposed to the different variants.

As we can see, one of the variants has been exposed to about 1.000.000 more users than the other. So, if we don’t normalize our metrics our results will be strongly biased.

Lets normalize the metrics.

Analysis

Overview
Now, we have the dataset we will use for the analysis, we will start with a quick overview. We will split the dataset into two, one for the testing variant and another for the default variant.

Using the describe function and the style feature pandas has (Nice feature I would recommend to look for at the link here), let’s create a quick summary of our data comparing both variants.

We are calculating the relative difference for each cell we get from the describe function, here the idea is just to get an overview. In that way to can make out what we should expect to see in our further analysis.

Just to notice, the values below are percentages, so 1.00 means 100%.

As we can see, the testing variant seems to be performing much better, than the control group. But, if we look at the “Reach” column we see again that our suspicion was true, one of the variants has been exposed to many more users than the other.

Analyzing the values below to make out which are more or less meaningful is an interesting subject, but I don’t want to complicate this post too much, so please fell free to do it on your end and reach me out with any doubts you have.

Time Series & Boxplot
Let’s look at this metrics over time, to make out if we can get any insides. We should not forget that we are working with time series, not independent data points, so we could even make out some patterns over time by looking at the line plots.

To do this we will join the data again, but we won’t stack one dataframe over the other, we will join them horizontally.

Also, we will calculate the daily difference between the variants for each metric. We can calculate the daily relative difference, this is what I would usually suggest, but if you are working with tiny numbers, as here, this could mean huge percentage values which are harder to analyze. I would recommend to avoid those values, specially if you need to present your work later on to other people.

Looking at the time series we see that among all the metrics two dates seems to be outliers 2019–08–12 and 2019–08–19, at least we should check if something happened during these dates. Also, looking at the boxplot it seems like that these values are possible anomalies.

Let’s assume for our case that we checked these days, and it actually comes out that some problems happened which caused anomalies. So, we will take them out from our dataset. In real applications take your time to understand these cases, not everything which looks not normal is indeed an anomaly.

Below you can see the graphs

We will drop the anomalies for the two dataframes we are using to analyze the data.

Let’s quickly check if it worked.

As we see those two dates has been excluded.

We will see another high value which is the 2019–08–08, but let’s assume that this value is not an anomaly, indeed not all the values after the upper bound value of the boxpolt are necessary anomalies.

Anomalies are an case by them self, we can’t make general rules and assume that they will be 100% accurate for all the cases. But, we can get good approximations, which is out of the scope of this article, but would be interesting to cover.

Pairplot

Now we will plot a pairplot to get an general visual overview of the metrics.

Some points to take into account when working on a pairplot:

This is usually a heavy graph, here our dataset is really small, but if you are working with a bigger datasets, which this function can’t handle, consider sampling your dataset.
The lower and upper triangle of the graph shows the same, but investing the axes, so:
One way to reduce the processing when creating this graph is setting corner=True, in that way only the lower triangle will be plotted.
If you’re plotting the lower and upper triangle consider adding something extra to one of the triangles as I did here with KDE plot, to get extra information.

Looking at this graph we can make out many things, but in general we could say that for most of the cases we see differences in the distribution.

Statistical test

Now we have a good understanding of our data, we cleaned it and took out all the anomalies. Now the question is, can we see a statistically significant changes among our variants to say that the hypothesis of our AB test has been meet? Besides of the hypothesis, do we see changes in other metrics.

There are several tests we could apply, a lot of them would need you to preprocess in someway the data, to keep it easy, but not less efficient because of this, we will apply two non-parametric tests. Why two? Because each test checks for something different, in this way we can have more confident in our results, as we look into the data with two different criteria.

The tests we will apply are:

At the links you can get a full explanation of them, they are easy to understand, but here are some characteristics I would like to highlight:

Both are pair-wise comparison techniques.
The null hypothesis in both cases is that both data point collections we are comparing (Testing and control) come from the same distribution, which would mean there is no change.
Both are non-parametric tests, they don’t assume any underlying distribution.
In both cases it assumes that samples are independent (This could be problematic as time series are by definition not independent, even so, it still being a widely used approach for our case)

Final results

As we can see, we are getting statistical significant results for most of the metrics (p-values <= 0.05). In our case we consider the result positive only if we get significant results with both statistical tests for a metric.

The only two metrics which doesn’t seem to change that much, or at least dosen show a statistical significant change, are the amount of times the item has been add to the cart and the amount of impressions.

What’s next?

With this explanation above you’re able to perform AB tests in a quick and robust way. From the statistical point of view you can apply other statistical tests which fits better to this particular base, but here we wanted to set up an general base line to start with AB testing. Also, take into account that in a company, if you want to implement an automated way to perform AB tests it takes time, an easier methode at the start is something I would recommend, later on you have time to develop something more complex if needed. Start small and build big is usually the best way to go.

Link to the code here

https://github.com/AlejandroAttento/medium/tree/master/ab_testing_in_real_life_step_by_step