Simpson’s Paradox In A/B Testing

Himanshu Verma

Published in

HomeAway Tech Blog

4 min readSep 19, 2018

Results of an A/B test on browser variants don’t seem to add up

Introduction

I work as a software engineer in the Experimentation Platform Team at HomeAway. Our platform analyzes data and generates readouts and statistics for A/B tests which help us optimize the HomeAway website for our customers. A/B test readout accuracy is crucial for business teams and product managers as they drive key decisions.

We generate statistics at the overall aggregate level as well as for individual segments of each test. Recently, we noticed a case where all individual segments in a particular test were showing impact in one direction, but overall aggregate level stats were directionally opposite. This kind of result is quite unintuitive to understand and therefore raised a lot of concerns.

Such scenarios are usually caused by Simpson’s Paradox. In this blog, I will explain the paradox in detail and why it is not always a good idea to make decisions solely based on intuition. We will use an A/B test example throughout the discussion.

Simpson’s Paradox

Simpson’s Paradox is a phenomenon that occurs when we observe a certain directional trend or relationship in all mutually exclusive segments of the data, but the same trend is not observed (or reverses) when we look at the combined dataset. It is commonly observed when analyzing/comparing proportions or averages (e.g., conversion rate and average booking value).

Definition

Consider two samples from a dataset containing N mutually exclusive segments:

First sample:

Each segment Si has Dfi number of observations, out of which Tfi are successful (Success Rate Tfi / Dfi).

Second sample:

Each segment Si has Dsi number of observations, out of which Tsi are successful (Success Rate Tsi / Dsi).

Simpson’s Paradox appears in the case when:

Tfi / Dfi > Tsi / Dsi for each segment Si (success rate of the first sample is greater than the success rate of the second sample when we consider individual segment)

But,

(Tf1 + Tf2 + … + TfN ) / ( Df1 + Df2 + … + DfN ) <
(Ts1 + Ts2 + … + TsN ) / ( Ds1 + Ds2 + … + DsN )

That is, the trend in overall success rate has reversed at the overall aggregate level.

Example

To understand the paradox better, let’s consider the following example:

We run an A/B test which has two variants (A and B) for two different browsers (Chrome and Firefox).

Variant A:

Firefox: 80 visitors were exposed to variant A, out of which 70 converted.
Chrome: 20 visitors were exposed to variant A, out of which 10 converted.
Total visitors who were exposed to variant A (regardless of browser) = 100 (80 + 20)
Total visitors who converted on variant A (regardless of browser) = 80 (70 + 10)

Similarly, for variant B:

Firefox: 20 visitors were exposed to variant B, out of which 20 converted.
Chrome: 80 visitors were exposed to variant B, out of which 50 converted.
Total visitors who were exposed to variant B (regardless of browser) = 100 (20 + 80)
Total converting visitors on variant B (regardless of browser) = 70 (20 + 50)

Here is the conversion rate table:

If we compare the conversion rates of variant A and variant B, we can see that the conversion rate of variant B is better than the conversion rate of variant A for Firefox and Chrome browsers individually. But when we do an overall comparison (both browsers), variant A outperforms variant B. How can variant B lose in overall results when it is a winner in all individual categories? This is Simpson’s Paradox.

Reasons For Simpson’s Paradox

Many would intuit variant B should be an overall winner because it performed better than variant A in both segments. It is important to note this intuition is based on correlation and should not be confused with causation. This is the main reason why this concept is difficult to understand.

Intuitive correlation only holds when the size of both variants are comparable (note that I intentionally failed to specify the A/B test traffic split in the example above). When we combine the results, cells which have more data dominate the totals for their corresponding variants and are capable of switching the direction of overall results. The Both Browsers row is dominated by Firefox Browser for variant A (87.5% success rate) while it is dominated by Chrome Browser for variant B (62.5% success rate).

Here is an example where the intuitive correlation holds (note the comparable denominators):

Learnings

Our A/B Test was configured for a 50–50 traffic split between the variants. After doing some raw data analysis we found out that there was a certain segment which had an issue and the size of data between both the variants were massively different. We should also note that it was our intuitive understanding not matching our results which directed us to investigate and helped us uncover the issue.

The important learning here is to ask questions whenever things look unexpected. Moreover, we should clearly understand the difference between correlation and causation. Statistics gives us accurate results based on the data, even when it has some non-obvious characteristics!