Causal Inference in A/B Testing: Navigating True Experimental Setups

8 min readNov 16, 2023

You may have seen this quote in lot of the article, where there is debate between correlation and causation.

“correlation does not imply causation”

Why do we need causal inference? What are the common issues with typical statistical results?

Spurious correlation: Traditionally we may be trained in a way that correlation/ association == causation i.e if there is a high correlation/association between iphone’s sales and death caused by falling down stairs, can we conclude “iphone is the cause of death”?

Answer is No, this peculiar case is called spurious correlation where it will not be the actual/direct causation, however there is a strong relationship between two variables. Note: There is no statistical techniques to detect spurious correlation “Just to use common sense” to identify it.

checkout for more examples on spurious correlation

Simpson’s Paradox: Simpson’s paradox is a statistical phenomenon where a trend visible in several groups reverses when the groups are combined. For instance, suppose you have two products product A and Product B. When you look at relationship of sales vs marketing spend for different products separately, marketing spend is positively correlated. But when you compare overall sales vs marketing spend across all products, the marketing spend it is negatively correlated! This surprising flip is Simpson’s Paradox, reminding us to consider all data collectively when evaluating marketing strategies. Incorporating domain knowledge could solve this paradox.

Confounders: A confounder is an unmeasured third variable that influences both the supposed cause and the supposed effect. For example, in a study finding a correlation between ice cream consumption and sunburns, the confounding variable is temperature: high temperatures cause people to both eat more ice cream and spend more time outdoors under the sun, resulting in more sunburns.

If these issues are in traditional statistical models, How we are gonna find the causality(‘cause’) of the particular event?

There are several causal inference methods available, out of that, today we are gonna focus on Randomized control experiment (A/B testing) a.k.a Test and Learn, which is the more robust method to understand the effect of particular event(i.e promotion).
Formulating the robust process for Test and Learn(A/B testing) experiment for the given hypothesis is critical task.

Key Terminologies:

Intervention: It is the change of particular event/treatment in particular time. For instance, increasing the promotion spend for 30 days
Treatment group: Treatment group is the set of datapoints where intervention is performed. For instance, the random group of customers where promotion spend is increased as a new strategy
Control group: control group is the set of datapoints where intervention is not performed. For instance, the random group of customers similar(interms of characteristics) to treatment group where promotion spend are not increased and no new strategy is implemented.
Counterfactual: It is the state ‘what if?’ the intervention would not occurred. control group is used to measure the counterfactual effect.

How to setup the randomized control experiment?

Define the hypothesis
Determine the sample size given the data
Select the Treatment and Control group
Conduct the experiment on treatment group
Analyze the test results.

Let’s take a business case to setup the A/B testing experiment.

The cola company is a popular beverage manufacturer. They are planning to offer a cola refrigerator as a promotion to all the retail stores which sells cola product. They have a strong belief that these refrigerator promotion improves their brand visibility and sales. To validate this hypothesis cola hired a data scientist to perform a experiment on sample of stores to understand the impact

1. Define the hypothesis:

H0: Refrigerator promotion doesn’t improves sales

H1: Refrigerator promotion improves sales significantly

2. Determine the sample size

Below formula helps to estimate the Minimum sample size to achieve statistical significance on experiment.

Z(1-α/2)- confidence level which reduces the Type 1 error
Z(1-β)- statistical power which reduces the Type 2 error
σ² — Pooled variance(σ²_treatment_group+σ²_control_group) of two groups(treatment and control). As we estimate sample size before intervention performed on treatment group, we will not able to get the σ² of treatment group. so we estimate it as 2*σ²_control_group as approximation.
Δ²- Square absolute mean difference ((μ1-μ2)²) of two groups. Issue here is we don’t know the μ of treatment group since there is no intervention to the treatment group yet. Hence estimate μ for Treatment group with the help of Minimum Detectable Effect(MDE) a.k.a Minimum Effect Size. MDE is the Expected percentage of lift/change in mean, when intervention is performed on treatment group.

Let’s apply the formula to our business case to find the Minimum sample size required for each group. Before calculating the minimum sample size we need to find the MDE. Since we don’t know the actual treatment effect we can substitute the expectation of effect size based on domain knowledge and marginal calculation.

cost of deploying a refrigerator in one store = $100.

price of cola product = $10 .

variable cost(cogs, cts, other expenses) = $7.

profit= 10–7= $3

Average sales observed in control group(μ1) = 70 units per week

break even lift = cost/profit =100/3=33.3

To breakeven(cover up) the treatment cost, cola needs to sell additional 33.3 units.

And we have assumed that the spending can be breakeven in 8 weeks. Hence we consider MDE as 33/8= 4.1 units.

Break even lift(%)=4.1 units /70 units =5.8%

Average sales observed in control group(μ1) = 70 units per week.

μ2= Mean sales of control group(μ1)+ MDE = 70+4.1 = 74.1

Δ²=(70–74.1)²=4.1²= 16.8

variance of sales observed in control group(σ²_control_group) = 12

pooled variance=2*σ²_control_group = 12*2 = 24

For 5% significance level(α) Z(1-α/2) = zscore(0.975)= 1.96
For 20% significance level(β) Z(1-β) = zscore(0.80)= 0.841

import scipy
power_of_test=0.80
significance_level=0.05
z_score_cl=scipy.stats.norm.ppf((1 - significance_level/ 2)
z_score_power=scipy.stats.norm.ppf(power_of_test)
print(z_score_cl,z_score_power)
# output: 1.96, 0.841

Now we can plug all the values to sample size formula,

n = (1.96+0.84)²*24/16.8= 7.84*24/16.8 = 188.1/16.8 =11.19 = 12

To conduct a experiment we need at least 12 stores on each group to give a reasonable conclusion.

why do we need to calculate sample size? Learn more about p-hacking, power analysis here.

3. Select Treatment and Control group

Few Key principles that we have to follow for selecting the treatment and control samples/stores.

Treatment group:

The samples in treatment group should be the best representation of the population.
we can use independent t_test to compare the sample stores selected for treatment and the whole population to determine whether the treatment group is best representation of the population.
The treatment samples should be selected randomly to avoid bias. if there are different groups, use stratified random sampling and make sure the samples also have same proportion as population group proportion.

Control group:

The samples in control group should be the best representation of treatment group. i.e in our case each treatment store should have similar (in terms of store characteristics, sales patterns) control store.
we can use DTW (Dynamic Time Warping) to find the stores with similar sales pattern, Euclidian distance to find the similarity in store characteristics. However, there are various approaches to find the similar stores that can vary depending on the data and problem.

4. Conduct an experiment

There are few assumptions on the experiment.

The control group and treatment group will have a similar sales pattern, in pre treatment period and post treatment period when there is no intervention.
There is no intervention / change/ treatment in control group
No confounding changes performed, other than intervention in treatment variable in treatment group.

Apply the treatment to the test/treatment stores and wait for entire post test period is completed. Determining the post test period involves business knowledge. In our case the sales effect introducing a refrigerator can be observed within a month. so we are using 1 month as a post test period.

5. Post Test Analysis

We have several approach to analyze the results of treatment and control results. However Difference-in-Differences(DID) is the better approach to quantify the treatment effect and validate the hypothesis.

Assumptions of DID model:

parallel trend: If we don’t give the treatment, both the group getting the treatment and the group not getting it will show similar changes over time.
Other factors/variables will also impact both groups in a similar way.

Main advantage of using DID is, that it can correct/reduce any bias in the sampling groups by including time period based changes.

How DID works?

Difference-in-Differences (DID) Estimation Method:

Calculate the pre vs. post change % for control group
Calculate Post Period Estimated outcome for treatment group i.e what would be my lift, if treatment group behaves as like(parallel trend) control group?.

Post Period estimated outcome for Treatment(estimated) = (Pre-Period outcome for Treatment )*(1+ Pre vs Post change (%) Control)/100)]

Calculate the performance lift

Treatment Vs Control Change (%) = 100 * (Post Period observed outcome for Treatment (actual)— Post Period estimated outcome for Treatment(estimated) )/ (Post Period estimated outcome for Treatment(actual))

Probability of observed lift to be similar to expected business lift:

1-cdf(expected lift, observed lift, pooled std)

Apply the DID to our refrigerator case study:

pre-period sales for control group= 70 units

post-period sales for control group= 75 units

control group pre vs post lift(%)= (75–70/70)*100=7.1%

pre-period sales for Treatment group=82 units

Post-Period estimated outcome for Treatment(estimated)=82 units*(1+(7.1/100))=87 units

post-period sales for Treatment group(Actual)=98 units

Treatment Vs Control Change (%) =((98 units-87 units)/87 units)*100 = 12.6%

The refrigerator promotion effect have 12.6% lift in sales units

probability of getting profit:

expected lift=breakeven lift(%)= 5.8 %

pooled std=3.4

1-cdf(expected lift, observed lift, pooled std)

1-cdf(5.8, 12.6, 3.4)=1–0.02= 0.98

Probability of getting profit on refrigerator implementation is 98%

Wrapping UP

Mastering the principles of causal inference, particularly through robust methods like A/B testing, is essential for discerning true causation amid statistical complexities. As demonstrated in our business case, a well-designed experiment, from hypothesis formulation to post-test analysis using techniques like Difference-in-Differences, empowers businesses to make informed decisions based on reliable insights. By understanding the nuances of correlation, avoiding spurious relationships, and addressing confounders, we pave the way for more accurate and actionable interpretations of data, ultimately enhancing the effectiveness of strategies and interventions.

Feel free to share your queries and thoughts in the comments.

Causal Inference in A/B Testing: Navigating True Experimental Setups

Why do we need causal inference? What are the common issues with typical statistical results?

If these issues are in traditional statistical models, How we are gonna find the causality(‘cause’) of the particular event?

Key Terminologies:

How to setup the randomized control experiment?

1. Define the hypothesis:

2. Determine the sample size

3. Select Treatment and Control group

4. Conduct an experiment

5. Post Test Analysis

Apply the DID to our refrigerator case study:

Wrapping UP

Written by Jagadeesanmuthuvel