Photo by Myriam Jessier on Unsplash

How to Analyze an A/B/C Testing

Pararawendy Indarjo
Bukalapak Data
Published in
9 min readMar 31, 2021

--

What to do if your AB test contains more than two groups

In the previous post (will be attached later in the article), I have shared that experimentation through A/B testings is one of the most prevalent tools used in Bukalapak. We perform experimentations regularly, in order to continuously deliver an even better experience for our users.

A common scenario would be as follows:

  1. We have a system version (can be a search/recommendation algorithm, a certain in-app page UI design, etc., you name it) running at production (the control).
  2. We develop a contender version that we thought could improve the one running at production (the variant).
  3. We perform an A/B test to validate the matter. If the variant version obtains a statistically significant — and meaningful — lift over the control, then we win the variant and replace the control with it. Otherwise, we got nothing to do (the control remains stayed in production).

That said, at Bukalapak, we have so many ideas on how to improve our platform. Therefore, sometimes, we need to test multiple variant groups simultaneously (e.g., the control group A is challenged by two variant groups: B and C, thus overall, we have three groups to test). This allows more rapid iteration, which benefits our users since this also means faster improvement delivered to them.

Technically (data science-wise) speaking, such tests are no longer reside within the realm of standard A/B testing method. Instead, they reside within a generalized/extended version of it, which is called — as you can guess — A/B/C testing. Generally, the number of letters follows the total number of groups we want to test. For simplicity, let us call whatever test with more than two groups being tested as an A/B/C test.

This blog is about how to perform analysis of such A/B/C tests. If you want a refresher about analyzing the standard A/B test instead, you can find the tutorial on the following post.

The problem with A/B/C tests

Before we proceed further, we need to recall two essential concepts related to hypothesis testing (recall that, after all, A/B or A/B/C tests are only a form of hypothesis testing). They are P-value and Type I error.

The first is P-value. Despite seeming simple, P-value is actually a slippery concept. P-value is the probability of obtaining a deviation at least as large as the one in the observed sample, given that the NULL hypothesis is TRUE (the considered distribution is thus the NULL hypothesis).

For example, a P-value of 0.05 in an A/B test means if the control and variant group are truly indifferent (actually the same) in terms of the considered metric they yield on the whole user population, every 5 out of 100 such AB tests will obtain the deviation/lift observed in the sample, or larger, simply because of random sample error

Next is the Type I error. This is the probability of rejecting the actually true NULL hypothesis. Thus, Type I error is the probability of concluding the control and variant groups are different in the metric. Although the ground truth is that they are actually equally-performant in the population.

For the rest of the article, we will consider the following A/B/C test sample. Suppose our growth team wants to improve its promo vouchers' redemption rate by tweaking certain parts of the MyVoucher page UI design.

There are three new designs proposed, design A, design B, and design C. They are ready to challenge the status-quo — the existing design. We, therefore, have the following four segments in our A/B/C test:

  1. Existing design
  2. Design A
  3. Design B
  4. Design C

We roll the experiment by exposing each of the targeted users with one of the above designs at random and record his/her action accordingly — whether or not he/she redeems the voucher. Suppose we have the following results (not real data — the numbers cited are fictitious). Note that we will use an ⍺ of 5%.

Table 1 — Experiment results

The first thing to do on analyzing AB tests is to conduct a Chi-square test. Yet, when there are more than two segments in the test, the Chi-square test only takes us to the aggregate level conclusion. If the test gets a significant result, it only informs us that at least one segment performs significantly different from the others, but we don’t know which segment it is. In light of this, it seems natural to experiment individually on each pair of the experimental groups. But, there is a hidden devil in this — that should be taken care of!

Consider the previous example; we have the following 6 pairs to test individually:

  • Control vs. Design A
  • Control vs. Design B
  • Control vs. Design C
  • Design A vs. Design B
  • Design A vs. Design C

Now let’s find out the probability we obtain at least one significant result by pure random (Type I error) from these 6 tests. Note that the complementary event of obtaining at least one significant result by pure random is that we don’t reject the null hypothesis on all 6 tests we conduct. By assuming independence, this latter event's probability is (1-⍺)⁶ = (1–0.05)⁶ = 0.735. This means our original desired probability is one minus this value or 1–0.735 = 0.265; a significantly greater value than the original ⍺ = 0.05 that we set!

To say it explicitly, by doing the mentioned scenario, which seems natural in the above, it results in exploding probability for us to have Type I error! What can we do about it? One way is to perform P-value correction.

By the way, the mentioned scenario of testing all the combination pairs of experiment groups, armed with a certain P-value correction strategy, is also called the posthoc test.

P-Value correction is the solution

There are many approaches to do P-value correction. In this article, we will only show two of them.

First approach: controlling Family Wise Error Rate (FWER)

We control the probability of at least having 1 false positive out of all the paired-tests we conducted in this approach. Put it precisely,

One method to achieve this is called Bonferroni correction, whose derivation is given below.

Where n is the number of comparisons (or pairs). Consequently,

Note that ⍺* is no longer the same as our ⍺ in the initial test, i.e., it is not the chance of having a false positive at the individual comparison. Instead, ⍺* is the probability of having at least one false positive among all comparisons conducted.

For example, ⍺* = 5% means out of all comparisons we made, there is a 5% chance to have at least one false-positive conclusion.

Despite being intuitive and simple, Bonferroni Correction is criticized for being too conservative. It is because the adjusted P-values explode so fast. To get an idea, 10 comparisons would lead to 10x greater P-values. This makes rejecting the NULL hypothesis so hard, i.e., it needs a very strong signal (evidence) to do so.

Moreover, on the flip side, this would increase the Type II error rate; There would be more comparisons we conclude as “non-significant,” although the ground truth is actually significant.

Second approach: controlling False Discovery Rate (FDR)

In this approach, we control the expected value of the proportion of false positives divided by all positives we get from all comparisons, more formally,

One method to achieve this is using the Benjamini-Hochberg correction. This method’s step-by-step is as follows:

  • Get the original (raw) P-values of each segment pair
  • Sort ascending the raw P-values
  • For the raw P-value at rank k

Where n is the same as before: the total number of comparisons we make.

Figure 1 - Step-by-step performing Benjamini-Hochberg correction on P-values

Again, note that ⍺* is no longer the same as our ⍺ in the initial test (also, it is different from the one in the FWER method). The interpretation of ⍺* is now the expected proportion of the number of false positives divided by all positives.

For example, ⍺* = 5% means, suppose from n comparisons, we have 100 positive conclusions, then it is expected to have 5 of them being false positives.

Analyzing the experiment

So we got the weapon to use, let us carry out the analysis for the experiment!

Table 1 — Experiment results (also put here for convenience)

First step: Do a Chi-square test on the whole table. You can follow my step-by-step tutorial here.

Second step: Perform a Z-proportion test for each pair of segments. A nice tutorial available here.

Third step: Compute adjusted P-values & gather the conclusion. Just do what I demonstrated in Figure 1.

All those three steps manually, or you can call one function to do everything for you. The good news is, I have written an end-to-end (wrapper) function named holistic_abtestin R to perform all the required tests.

First, the function will perform a Chi-square test on the aggregate data level. If this test is significant, the function will continue to perform a posthoc test that consists of testing each pair of experimental groups to report their adjusted P-values and their absolute lift (difference) confidence intervals. The function code is presented below.

For more information regarding the function, including the format of the data, function parameters, and how to read the function outputs, you can find them on my GitHub repository here.

Alright, it’s time to see the function in action! To this end, we will use Benjamini-Hochberg P-value correction, with ⍺* = 5%.

Code

# prepare data
target = c(8333,8002,8251,8175)
redeemed = c(1062,825,1289,1228)
data = as.table(cbind(redeemed, target))
dimnames(data) = list(segment = c("control","design_a","design_b","design_c"), action = c("redeemed", "target"))

# define the functions
# define posthoc_abtest()
# define holistic_abtest()

# use the function
holistic_abtest(data = data, method = "BH", alpha = 0.05)

Result

              pair  raw_p_value adj_p_value lower_ci upper_ci
1 control vs design_a 1.286e-06 1.93e-06 * -0.0342 -0.0144
2 control vs design_b 1.222e-07 2.44e-07 * 0.0180 0.0395
3 control vs design_c 2.564e-05 3.07e-05 * 0.0121 0.0334
4 design_a vs design_b 9.910e-24 0 * 0.0427 0.0635
5 design_a vs design_c 2.780e-19 0 * 0.0367 0.0574
6 design_b vs design_c 2.949e-01 0.294911 NA NA

By looking at the adjusted P-values which are significant, coupled with their corresponding lift’s confidence interval, we conclude that Design B and C are together the best designs for MyVoucher page UI since they performed significantly better than the other segments (Existing design and Design A) in terms of redemption rates generation. Particularly, if we compare Design B with the existing design, we expect to have a 1.8%-3.95% lift in redemption rate.

Moreover, since their (Design B and C) redemption rates are statistically indifferent (relatively the same), it is up to the stakeholders to choose which one will be rolled into real production!

Closing

Thanks and congratulation for reading this far! 👏

In this blog, we learned that there is a hidden devil in analyzing A/B/C tests, particularly when doing the posthoc test; the Type-I error rate would be inflated! One way to resolve the issue is by doing a P-value correction.

We later also learned that there are two approaches in doing P-value correction: controlling FWER or FDR. Finally, we demonstrated how to analyze a sample A/B/C test using an end-to-end wrapper function I wrote on GitHub.

For a remark, note that the function we use is for proportion-based metrics only. We need to resort to a slightly different testing strategy for continuous metrics.

Finally, I hope this article helps you to carry out your next A/B/C test analysis! ✨

--

--