The AB testing Cookbook -Part 3

11 min readAug 16, 2023

This article is third in my series called, “The AB testing Cookbook”, where I planned to give a comprehensive step-by-step guide to AB testing.

In the first article we discussed the need for AB testing and why is it necessary for business stakeholders to run AB tests. In the second article, we discussed a few fundamentals you need to know before running these tests. A small recap, AB testing is basically hypothesis testing, where you form a null and alternative hypothesis, then you decide on your metrics and then based on the nature of the metrics and assumptions(normal/non-normal etc) about your data, you choose the hypothesis test. Choosing a test that validates the assumption about your data is important otherwise the results will be invalid.

In this article we will dive deeper into these hypothesis test and their different kinds, so buckle up!

Mainly, there are two kinds of tests, parametric and non-parametric tests. Parametric tests assume that the population from which your data is drawn from has a normal distribution, they make an assumption about the parameters of the population, hence the name. Whereas, non-parametric tests do not make such assumption. Another important distinction between the two is that parametric tests are suitable for continuous data whereas non-parametric test can be used for continuous and categorical data.

The differences and uses would become more clear if we learn more about the specific tests that fall under these 2 categories. So, let’s get started.

Parametric test

1. T-test

The data is assumed to follow a Student’s t-distribution. A t-distribution is nothing but a normal distribution for small sample sizes(usually less than 30 observations). As the size of the sample gets larger the distribution looks more like a normal distribution.
The standard deviation of the population is unknown. If the standard deviation is known and the sample size is large we would do a z-test discussed in the next part.
The samples are randomly drawn from the population which is a step in AB testing that we discussed in the last article.

Those are the basic assumptions of the t-test. They can be further divided into two categories useful for AB testing.

Independent T-test/ Welch’s t-test

This is used when you want to compare the means of two different groups. For example, do students who learn using Method A have a different mean test score than students who learn using Method B.
The formula is fairly simple

Don’t get scared by the math, you’ll rarely use this formula yourself. It just for explaining the math behind the test. All of these tests are implemented in scipy stats. scipy.stats.ttest_ind is the function for this test. The function would return a t-value in the above formula also called the test statistic and also it’s respective p-value. We would compare this p-value with the α decided earlier(usually 0.05). If p-value is lesser than that we reject the null-hypothesis otherwise we don’t.

Paired T-test/ Dependent t-test

As the name suggests this test is used when each subject has a pair of measurements to see whether the mean change for these pairs is significantly different from zero.
The two samples are paired or dependent as they contain the same subject. To continue the above example, a teacher has a hypothesis that preparing students for a test using a different learning method will improve their score. So he conducts one test before changing his method and one test after it. For each student now he would compare these pre and post test results to see if there’s a significant difference.

This test statistic can be calculated in python using scipy.stats.ttest_rel. The p-value is returned like in the above test and you can follow the same steps.

2. Z-test

This test is used to compare the mean of two samples where the population variance is known and the sample size is large(>30). You would use it for a similar use case as the independent t-test above.
As the name suggests there is an implicit assumption that the data follows the normal distribution. The formula is listed below.

Z-test are rarely used in practice because the population variance and mean are rarely known. Moreover, the t-test is a robust alternative and is more accurate in estimating the population parameters than the Z-test.

3. F test/ Anova

In case you have got more than 2 samples to compare the mean, you can use the F-test. It is an extension of z-test and t-test and is based on the F-distribution.
Variance between the groups and within groups is used to calculate the F-statistic. It can be formalised as

If you want to read in detail how to compute the numerator and denominator, you can read here. It can be easily computed in python using the function scipy.stats.f_oneway. The function would return a f-statistic and p-value.
If you are thinking why variance is used to find whether there is a statistical difference in the mean or not, here’s the intuition.

Source: https://statisticsbyjim.com/anova/f-tests-anova/

A Low F-value indicates there is more within group variability than the variance between groups. The group means cluster together and the distance between means is small relative to the random error within each group. These indicates that these groups aren’t truly different at the population level.
The opposite case is true for a high F-value, group means are more spread-out than the variability of data within the group. These groups/samples might come from different populations.
In the example I have already shared, if teachers wanted to look at the effect of different teaching methods A,B,C etc on test scores they would use F-test. The F-test we have been talking about until now is called a one-way ANOVA test.
There is another variant called a two way ANOVA test where the effect of different predictors can be seen. Let’s say apart from the effect of teaching method on test scores, if teachers also wanted to see the effect on gender on test scores, they can do that in the same test using a two way ANOVA test.
A disadvantage of F-test is that it just tells you if there is a statistical difference between multiple groups, it won’t be able to tell which groups specifically have that difference.

Those were the major kinds of parametric tests. They are useful when you know your sample follows a normal distribution, but in cases where it doesn’t, non-parametric tests will be helpful.

Non-parametric test

1. Chi-squared test

Here not only the predictor is a categorical variable(teaching method in the above example) the outcome also is a categorical variable and the test is used to find a relationship between these two.
For example which learning methods leads to more people passing the test. Instead of a continuous variable here like test score we have a categorical variable, pass or fail .
The null hypothesis would be that there is no difference between the passing rate for the two different teaching methods.
The data follows a chi-squared distribution and the classes in your outcome variable have to be mutually exclusive.
The data for the above example could be represented in a contingency table.

The formula to calculate the chi square statistic is as follows:

The above expression is calculated for each cell and summed. The observed value for a cell is its actual value and expected value is (row total * column total)/ total number of observations.
For students who went through teaching method A and passed have the observed value as 900(from the contingency table) and expected value as (1100*1600)/2200
The chi square statistic can be computed using this function in python scipy.stats.chisquare. It returns a p-value and I hope by now you know the drill from there.
When you have a smaller sample size, chi square statistic doesn't hold and Fisher’s exact test can be used. There is an implementation for this in python as well.

2. Mann Whitney U Test/ Wilcoxon rank sum test

This is the non-parametric equivalent of the independent t-test. It does not make any assumptions about normality and works well with small samples.
It works on continuous or ordinal data by combining the data of the two samples, ordering it and then ranking each data point. The formula for finding U-statistic is pretty simple and you can find it here, but I will explain this with the same example of finding out if there is a significant difference in test scores of students who were taught with two different teaching methods.

These are the marks and we want to see if there’s a statistically significant difference or not
.Group A: 85, 78, 92, 70, 88
Group B: 95, 68, 76, 89, 81
Let’s go through the process step by step.
Step 1: Rank the Data: Combine the data from both groups and rank them in ascending order. Assign ranks to tied values by taking the average rank.
Combined Data (sorted): 68, 70, 76, 78, 81, 85, 88, 89, 92, 95
Ranks: 1, 2, 3, 4, 5, 6.5, 6.5, 8, 9, 10
Step 2: Calculate U Statistic: Calculate the U statistic for both groups separately.
For Group A: Sum of Ranks in Group A: 1 + 2 + 6.5 + 9 + 6.5 = 25.5 Number of Observations in Group A (N1): 5
U1 (Group A) = 25.5
For Group B: Sum of Ranks in Group B: 4 + 3 + 5 + 8 + 10 = 30 Number of Observations in Group B (N2): 5
U2 (Group B) = 30
Step 3: Compare U statistic with critical value :We first compare the U statistics of both the groups. The smaller U statistic corresponds to the group with the smaller median.Since U1 is smaller, you would use U1 for further comparison.
You would use Mann Whitney U distribution table to do this but for now let’s assume that for a significance level of 0.05, the critical value is 9.
Since U1 (Group A) is 25.5, which is greater than the critical value of 9, you would reject the null hypothesis. This suggests that there is a significant difference in test grades between the two teaching methods.

The U statistic can be computed using scipy.stats.mannwhitneyu in python. It is advisable to use the t-test (because it has more statistical power) even if you have more than 15 observations in each group with non-normal data because of the central limit theorem which states that as the sample size increases the distribution converges to normality.

3. Kruskal Wallis H-test

Like Mann Whitney U test was a substitution for Independent T-test in the absence of normal data, Kruskal Wallis H-test is for ANOVA. Basically, if you want to compare more than two groups you use this test. It makes the same assumption of ordinal or continuous data as the above test.
The ranking works pretty much the same as in Mann Whitney, you combine all the data and rank it. Calculating the H-statistic is a bit different. The formula for H-statistic is as follows:

I’ll again simplify this with the same example.
Test Grades:
Method X: 85, 78, 92, 70, 88
Method Y: 95, 68, 76, 89, 81
Method Z: 72, 88, 90, 79, 85
Step 1: Rank Data: Combine the data from all teaching methods and rank them in ascending order, assigning ranks to tied values by averaging the ranks they would occupy.
Combined Data (sorted): 68, 70, 72, 76, 78, 79, 81, 85, 85, 88, 88, 89, 90, 92, 95
Ranks: 1, 2, 3, 4, 5, 6, 7, 8.5, 8.5, 10.5, 10.5, 12, 13, 14, 15
Step 2: Calculate the H Statistic: Calculate the sum of ranks for each teaching method and the overall mean rank (R).
For Method X: Sum of Ranks for Method X: 2 + 5 + 14 + 1 + 10.5 = 32.5 Number of Observations in Method X (N1): 5
For Method Y: Sum of Ranks for Method Y: 15 + 4 + 6 + 12 + 7 = 44 Number of Observations in Method Y (N2): 5
For Method Z: Sum of Ranks for Method Z: 3 + 10.5 + 13 + 8.5 + 5 = 40.5 Number of Observations in Method Z (N3): 5
Overall Mean Rank (R): (1 + 2 + … + 15) / 15 = 8
Now, calculate the H statistic using the formula given above.
H = (12 / (15 * 16)) * (5 * (32.5–8)² + 5 * (44–8)² + 5 * (40.5–8)²) H = 16.5
Step 3: Compare with Critical Value and make decision: If possible determine the critical value from H distribution for the significance level(0.05 usually). If the distribution is not available, chi-square distribution can also be used as an approximation. Let’s assume the critical value is 5.99 for a significance level of 0.05 . Since the calculated H statistic (16.5) is greater than the critical value (5.99), we reject the null hypothesis. This suggests that there are significant differences in test grades among the different teaching methods.

When to use this test? When either you have less data or median is more important to you than mean. If you have 3–9 groups and more than 15 observations per group or 10–12 groups and more than 20 observations per group, you might want to use one-way ANOVA even when you have non-normal data. The central limit theorem causes the sampling distributions to converge on normality, making ANOVA a suitable choice.
Like one way ANOVA, Kruskal Wallis test can tell you that all of your groups are not equal but it won’t tell you exactly which groups are not equal.

These are the most widely used non-parametric tests, apart from these there is Wilcoxon Signed-rank test which is a substitution for paired T-test.

Parametric vs Non-parametric test

It is important to know the tradeoffs between the two types of test, before you start hypothesis testing.

It is clear that for continuous data non-parametric tests are useful when your sample size is quite sample(~30 samples per group). But as your sample size increases it’s better to use parametric tests even if you have non-normal data because of central limit theorem.
Parametric tests have a higher statistical power — if an effect actually exists a parametric test is more likely to detect it.
Non-parametric tests although more flexible have one hard requirement of similar variability between groups that parametric test does not.
Non-parametric tests are better for problem statements when median is a better measure of central tendency than mean.
While parametric tests always require continuous data, non-parametric tests are good for categorical, ordinal, ranked data and outliers(read more about this here).

Once you have found out whether a difference is statistically significant or not, what’s next? We’ll find that out in the next article.