Photo by NordWood Themes on Unsplash

How to Analyze an A/B/C test with Mean-Based Metric

Pararawendy Indarjo
Bukalapak Data
Published in
9 min readMay 19, 2022

--

Learn and perform ANOVA and Tukey Post-hoc Test

As a data-driven company, Bukalapak relies heavily on AB testing. We use AB testing to determine whether or not to launch new/iterate on product offerings to our users, thanks to its ability to establish causality.

In the previous blog, we discussed how to analyze an AB test. On another, we shared a more advanced one on how to analyze experiments with more than two groups. The article is linked below.

However, astute readers will notice that the metrics under consideration in both tutorials are proportion-based, i.e. conversion rates ranging from 0% to 100%.

In this blog, we’ll round out our toolkit by talking about how to analyze an A/B/C test using mean-based metrics from a continuous measurement.

Working Example

Consider an experiment about different promotion strategies. There are three treatments (experiment groups), say:

  1. Control: the existing promo strategy
  2. Variant1: the first challenger promo strategy
  3. Variant2: the second challenger promo strategy

The metric of interest is the average transaction amount. To make the computation tractable (re: to make this article a good tutorial blog), let’s assume there are only 10 transactions gathered from each of the experiment groups, with the following details.

Figure 1: Experiment data

As we see, the observed averages of transaction amount for control, variant1, and variant2 are IDR 79.4K, 82.2K, and 91K, respectively. As the usual objective of AB test analysis, we want to know whether or not these observed differences are significant. And ultimately, we are interested to know which promo strategy is the best performant among the three being tested.

One-Way ANOVA

The first step is to test whether there is at least one group with a different mean. More formally, we want to test the following competing hypotheses.

H0: All transaction amount means (averages) are the same

H1: There is at least one transaction amount mean that differs from the rest

We can use ANOVA to accomplish this. In a nutshell, ANOVA compares the means (averages) of different groups by determining how likely it is that they are coming from the same distribution (i.e., the means are actually the same) by analyzing their variances (hence the name analysis of variance — ANOVA).

As a refresher, recall that variance is the measure of how far apart data points are from their mean.

The intuition goes like this. First, we partition the variance from the overall data points (treating all data points as a single large group) into two components: variance between data points within the same group and variance between different groups’ means (treating each group’s mean as an imaginary aggregated data point).

Figure 2: Different cases of total variance partition illustration

If the first component of variance is smaller than the second, we can conclude that the means of the groups differ because the distribution of data points from different groups is well separated (Case 1 in Figure 2), and we can be confident that they come from different populations — with different averages.

And vice versa; if most of the total variation is contributed from within-groups variation, we can say that the groups’ means are perhaps not different in the first place (Case 2 in Figure 2).

Going technical, we do this by calculating the F statistic, which is the ratio between two sums of squared error (SSE), normalized by their degree of freedom: between-group SSE and within-group SSE.

This value is then compared with an appropriate F reference value. If the F statistic is greater than the F reference value, then we conclude that we can reject H0. That is, we are confident that there is at least 1 group whose mean is significantly different from the rest.

Below are the concrete steps of our working example above.

  1. Find the average of all data points. Note that we have 30 data. The value is IDR 84,250.
  2. Calculate the total sum of squared errors (SSE) from all data points. For each data point, we subtract by IDR 84,250, then we square it. Take the sum of those 30 such terms. The value is 2,875,395,000.
  3. Calculate the within-group SSE. Note that we have 3 groups, each has 10 data points. For each data point in a given group, we subtract by the mean of the given group, then we square it. Take the sum of those 10 such terms. Do the same for each group, and sum everything. The value is 2,142,649,000. (See Figure 3 for a visual calculation details on step 1–3)
  4. Derive the between-group SSE, it is the delta between total SSE and within-group SSE. The value is 2,875,395,000–2,142,649,000 = 732,746,000.
  5. Get the degree of freedom (DF) value for between-group SSE and within-group SSE. Let N and k be the total number of data points (30) and the number of groups (3), respectively. Then between-group SSE’s DF = k-1 = 2, and within-group SSE’s DF = N-k = 27. (Great explanation on what are degrees of freedom is available here)
  6. Normalize (divide) each SSE with its DF to get between-group MSE = 732,746,000/2=366,373,000 and within-group MSE = 2,142,649,000/27=79,357,370.
  7. Finally, compute the F statistic. It is the ratio between: between-group MSE and within-group MSE. F statistic = 366,373,000/79,357,370 = 4.61.
  8. Get the appropriate F reference value. F reference value has three parameters: alpha (significant level, equal to 0.05 as usual), between-group MSE’s DF, and within-group MSE’s DF. Thus in our case, it is F(0.05,2,27). We can utilize any online calculator to get the value (I use this site). The value is 3.35.
  9. Lastly, compare the F statistic (4.61) with the F reference value (3.35). Because the F statistic is larger than the F reference, we reject the null hypotheses (H0). Therefore we conclude that there is at least one transaction amount mean that differs from the rest. (see Figure 4 for the classical tabulation of step 4–9)
Figure 3: Step 1–3 computation details in spreadsheet
Figure 4: The classic ANOVA tabulation (Step 4–9)

Tukey HSD Post-hoc Test

Although we previously rejected the null hypotheses in ANOVA, our analysis is not yet complete. The problem is that ANOVA does not reveal which group is the clear winner. So far, the rejection indicates that at least one group has a significantly different mean than the others. Which of these is it? We don’t know.

A post-hoc test is required to determine the clear winner. There are numerous tests available, but we will use the most commonly used one, known as the Tukey Honestly Significant Difference (HSD) test.

A Tukey HSD test compares each pair of group means and calculates the corresponding q statistic. What exactly is the q statistic? It is essentially a variant of the t statistic (as used in the infamous t-test) that accounts for Type I error inflation due to multiple testing. Read “The problem with A/B/C tests” section on my previous tutorial to understand about the issue on multiple testing. Therefore, we can think of Tukey HSD test as a more conservative t-test (i.e. it requires greater evidence for rejecting the null hypothesis).

See Figure 5 below for the comparison of critical values from t-distribution and multiple versions of q distributions (depending on the number of means being considered — to test in total). We can clearly see that the greater the number of means, the higher the critical value — and thus the more conservative test we have.

Figure 5: Critical value comparison between t and various q distributions

Please find the q statistic formula below before proceeding with the working steps of conducting the Tukey HSD test. It is important to note that n denotes the number of data points in a group (in our case, n = 10).

Figure 6: q statistic between mean A and mean B formula

The practical steps are as follows. For each pair of means of experiment groups, we do:

  1. Calculate the delta between two groups’ means.
  2. Derive the corresponding q statistic value using the formula in Figure 6
  3. Get the appropriate q reference value. The good news is, the parameters are the same as in ANOVA. I.e., we are looking for q(0.05,2,27). Using an online q value calculator, we get the value of 2.90.
  4. Compare the q statistic with the q reference value. Same as before, we say the means pair are significantly different IF its q statistic is greater than the q reference value.

See below for the worked-out tabulation.

Figure 7: Tukey HSD test results

From the results in Figure 7, we learn that:

  1. Control and Variant1 have the same transaction amount performance.
  2. Control and Variant2 have different transaction amount performances.
  3. Variant1 and Variant2 have different transaction amount performances.

Finally, because Variant2 has the highest mean transaction amount (91,040), we can conclude that it is the best promotion strategy.

Analysis using Python

We have performed the analysis from scratch in the above. Can we do it in Python, you might ask?

Surely we can. In fact, thanks to the pingouin package, the analysis becomes (too) simple.

The only preparation required is to convert the data format from Figure 1 to a long-formatted dataframe, as shown below.

Figure 8: Long-formatted data for analysis using pingouin package

After the data is ready (let’s say we named the dataframe df), we can perform ANOVA using the short syntax below.

!pip install pingouin==0.5.1import pingouin as pgpg.anova(dv='amount', between='group', data=df, detailed=True)

The below output will appear.

Figure 9: ANOVA results from pingouin

Note that in the output above, group corresponds to between-groups in our previous term. We can focus on p-unc here. Since the value is less than 0.05 (the typical alpha/Type I error), we can conclude that there is at least one group that has a different mean of transaction amount!

It is also simple to perform the Tukey post-hoc test. We continue to use our df and run the one-liner code below.

df.pairwise_tukey(dv='amount', between='group')

The following results will pop out.

Figure 10: Tukey post-hoc results from pingouin

Note that diff here is derived from mean A — mean B (opposite from what we did manually). That said, by looking at p-tukey values — and still, comparing them with 0.05, we will arrive at the same conclusion as we had before. That is, variant2 is the best promo strategy.

Moreover, the nice thing about the above output is the standard error se value for each pair, which we can use — alongside diff — to derive confidence intervals. Recall the formula to derive confidence interval (CI) is CI = [diff — se, diff + se] . (see my another article to know the benefits of having confidence intervals).

Closing

Congratulations on making it this far! 🎉

In this article, we discussed the methods required to analyze an A/B/C test with a continuous mean-based metric. We began by learning the concept/intuition behind the methods (ANOVA and Tukey post-hoc test), then performed both tests manually from scratch to get a firm grasp, and finally conducted the analysis in Python using the pingouin library — which makes things rather trivial due to its simplicity.

Finally, thank you for taking the time to read this! I hope this article helps you analyze your next AB test where the metrics are mean-based! ⚡️

References

Jim Frost. 2020. Hypothesis Testing. Self-publishing.

https://aaronschlegel.me/tukeys-test-post-hoc-analysis.html

--

--