How Simpson’s Paradox Could Impact A/B Tests
Simpson’s paradox occurs when we observe a certain trend in the aggregate data but not in the underlying segments that comprise the data. In the A/B testing domain, Simpson’s Paradox can occur when the overall mean conversion rate and/ or average order value of the experiences tested point to a result different from the mean conversion rates and/ or average order value of the underlying segments.
Let me illustrate this with an example from Georgi Georgiev’s blog post, instructor at CXL. Suppose you run an A/B test between Page A and Page B and see the following results:
Looking at the average conversion rate, it looks like you have a conclusive test with B beating A (assuming the sample size requirements, and other conditions such as statistical significance and power were met). But before you take that victory lap around the office, you see something completely unexpected. When you segment the data by the different traffic sources, you see that A has outperformed B for each traffic source!
What does this mean? How is this even possible? This is a classic example of Simpson’s Paradox.