Photo by Chad Kirchoff on Unsplash

Meet the Engine of A/B Testing: Chi-Square Test

Pararawendy Indarjo
Bukalapak Data
Published in
8 min readAug 28, 2020

--

Understand the concept and perform one from scratch

A/B testing is a user experience research methodology to prove causal relationships. It represents shorthand notation for a simple controlled experiment, where users are randomly served with two different variants: variation A (control) and variation B (Young, 2014). The interest is then to figure out which variant is the better performer, in terms of some predefined metrics.

Here at Bukalapak, A/B tests are highly prevalent. We perform them regularly to optimize almost any aspect of our app. Our use cases range from revamping a certain page, to testing different recommendation algorithms, to tweaking specific user shopping journey; essentially to provide the best possible experience for our users.

A/B testing is even more beneficial since it proves causality — rather than just correlation. Therefore, it is perfectly in-line with Bukalapak Data Team’s principle: “Turning insight to action”, so from looking at associations in the data we want to prove causality via A/B test, and then charge on actions for growth.

A/B Testing in practice

One example of A/B testing at Bukalapak is one we recently did with our Mitra (our agent partner) address filling. For some context, we encourage our Mitra to fill in their address, since it allows us to provide them with better user experience (e.g. to notify them with relevant promos happening close to their location). We divide them into different segments and send different types (wording) of push notification.

A/B Testing our notification strategy with our Mitra.

Turns out, the push notification with a community-empowerment message outperformed the other which offers rewards (voucher; cashback) in return for filling in their address. From this A/B test, we could conclude that the content of messaging mattered more than potential carrots (and saving some extra unnecessary costs).

Realizing the benefits that A/B tests have to offer, we at Bukalapak develop Splitter: our in-house A/B testing platform that allows us to perform experiments quickly at scale. All we need to do is to set up the variant treatment and define the success metrics of interest. Beyond these two, Splitter will handle the rests (from splitting the traffic — to serve users different treatments— to analyzing the experiment result).

But what goes under the hood inside Splitter? What scientific tool actually powers A/B tests like the one mentioned above? From a statistical point of view, an A/B test is actually another form of hypothesis testing, in which we need to resort to a certain statistical testing method to gather the conclusion from. As it turned out, the chi-square test is precisely the method that we were looking for.

In this blog, we will walkthrough the theoretical concept of chi-square test. Next, we will go through a working example, i.e. analyzing an A/B test example from scratch, so that we deeply understand how things work.

Chi-Square Test

The chi-squared test (for independence) is a statistical test to evaluate whether or not the distributions of two or more categorical variables — each variable has two or more possible values— are actually independence or homogenous (i.e. how the values are distributed on the two variables are relatively the same).

As the definition suggests, the data on which chi-square tests are used is a typical contingency table like one below.

Figure 1: Example of a contingency table

In the contingency table above, we have two variables in learning methods (visual and auditory), each has two possible values (pass the exam or fail).

A data set like this is often called an “R×C table,” where R is the number of rows and C is the number of columns. This is a 2×2 table (McDonald, 2014).

Hypotheses to be tested

In the standard form of chi-squared test, the null-alternative pair of hypotheses to be tested are as follows:

  • Null: The variables are independent, meaning the value distributions on the variables are relatively the same
  • Alternative: The variables are not independent, how the values are distributed depends on the variable

Test statistics

Given a contingency table, the test statistics of chi-square test is formulated as follows.

Equation 1: Chi-square statistics

where

Moreover,

Equation 2: Formula to compute the expected value of cell i,j

The test statistics in Equation 1 is known to approximate the chi-square distribution with degree of freedom (R-1)x(C-1) (Frost 2020). For example, a 2x2 contingency table like the one in Figure 1 implies (2–1)x(2–1) = 1 as its degree of freedom.

Figure 2: Probability density function of several chi-square distributions with k: degree of freedom (Source: Wikipedia)

Compare test statistics to table value

After we compute the test statistics, we compare it with the table value. Precisely, our table value is

where k and alpha are the degree of freedom and our predefined significance level, respectively.

If we find our test statistics is greater than the above table value, we can confidently reject our null hypothesis. That is, we conclude that the variables are not independent, how the values are distributed depends on the variable at the given significance level.

Working example

It’s now time to get our hands dirty. Let’s have a concrete example!

Suppose there is a digital company that wants to improve the redemption rate of its promo vouchers by revamping their current MyVoucher page design. So, we have the following two competing designs:

  1. Control: the existing design
  2. Variant: the revamped design

They roll the experiment by serving each of users with one of the two designs randomly and record his/her action accordingly — whether or not he/she redeems the voucher.

Suppose we have the following result.

Figure 3: A/B test result (original form)

Note that we can derive an equivalent table as follows from the table in Figure 3, which might be more familiar for business users.

Figure 4: A/B test result (equivalent form)

We see from Figure 4 that the redemption rate from the revamped design is higher than what the existing design yielded. Nevertheless, it might be the case that the difference was actually caused by some inherent random noise (not statistically significant). Using the chi-square test, it is then our task to check whether or not the difference is significant.

Hypotheses

In other words, we want to test two competing hypotheses as follows. Notice the difference in the wording, compared with the previous one we explain in the concept part — nevertheless, they are equivalent.

  • Null: There is no significant difference in redemption rates obtained by the two designs
  • Alternative: There is a significant difference in redemption rates obtained by the two designs

Before we carry out any computation, to prevent cheating with data, we set our alpha to be 0.05 (5%) throughout our analysis.

Computing expected cell value

We first compute the expected value for each cell of the contingency table in Figure 3. To this end, we use Equation 2.

For convenience, let’s put these values in a table.

Figure 5: Table of expected values

Computing test statistics

Next, we compute the test statistics, whose formula is given in Equation 1. This is quite straightforward since we already have the two ingredients, namely the actual result table (Figure 3) and the expected value table (Figure 5).

Gather the conclusion

After we have the test statistics value (67), we need to compare it with the table value, that is the random variable value when the chi-square distribution with (2–1)x(2–1) = 1 degree of freedom hits the probability of (1–0.05) = 0.95. The value is 3.84 (source).

Therefore, we know that our test statistics (67) is greater than the table value (3.84). Thus we reject the null hypothesis. There is enough evidence to state that there is a significant difference in redemption rates obtained by the two designs.

Moreover, since the redemption rate of the revamped design is higher than control’s (see Figure 4), we can conclude that the revamped design is the winner of this experiment — the revamped design is better for MyVoucher page than the control — at 5% significance level.

Closing remarks

From this article, we understand that an A/B testing can be seen as a statistical hypothesis testing problem. We learn the concept of the chi-square test, the statistical tool that powers such A/B testings. Afterwards, we implement the test to analyze a working example of A/B testing from scratch. Hopefully, by doing so, you’ll have a better understanding of the methodology.

Since this article is not — by any means — intended to be a comprehensive reading on either A/B testing nor chi-square test, here are some points to be aware of:

First, regarding chi-square test. The chi-squared test is only valid if the sample size is relatively large, i.e. > 1000 (McDonald, 2014). If this threshold is not met, the test result might be not reliable. In such cases, one can use Fisher’s exact test instead.

Second, there is another type of chi-square test, other than the independence/homogeneity test we discussed in this article. The counterpart is called chi-square test for goodness of fit. Briefly speaking, the test is used when we want to test whether or not the distribution of a categorical variable follows a specific (given/assumed) distribution.

Third, regarding the metrics of A/B testing. We can use other metrics that aren’t based on proportions (such as redemption rate in this article). Depending on the problem at hand, we might want to evaluate some numeric-continuous metrics (for instance: average transaction amount) through A/B testing. For the consequence, we need to resort to different statistical technique to analyze such experiments.

Finally, we can generalize A/B testing to include more than two non-control variants (a.k.a. multivariate A/B testing). Again, it results in a (slightly) different statistical method used to draw conclusions from them.

Thanks for reading, and happy experimenting!

Resources

Young, S. W. H. (2014). Improving library user experience with A/B testing: Principles and process. Weave: Journal of Library User Experience

McDonald, J.H. (2014). Handbook of Biological Statistics (3rd ed.). Sparky House Publishing, Baltimore, Maryland.

Jim Frost. (2020). Hypothesis Testing. Self-publishing.

Special thanks

Special thanks goes to Jonathan Kurniawan, who helped to proofread this article.

--

--