Part 1: Simplifying causal inference to connect stakeholders and data scientists
Bridging the causal inference gap between business and data people
This is the first in a series of posts designed to make causal inference both accessible and actionable for business professionals and data scientists. The overall goal is to bridge the communication gap between these groups, ensuring they are on the same page.
- For data scientists, I provide tools to explain these concepts clearly to non-technical audiences.
- For business professionals and novice data scientists, I offer simplified explanations and practical examples, helping everyone stay aligned.
In this initial post, I’ll cover foundational topics like causal effects, counterfactuals, and biases. Upcoming posts will build on this foundation, exploring key terms and frameworks to deepen your understanding and promote collaboration across teams :)
1. Causal Inference and effects (also includes lift, downstream impact, treatment and control groups)
Technical explanation: Causal inference aims to determine if one thing causes another. The causal effect is the difference in outcomes between what happened and what would have happened without the intervention (see “counterfactual”). Other terms may be used to convey the idea of causal effect such as lift (the difference in outcomes between treatment and control groups) and downstream impact, which refers to effects that occur later or in related outcomes, like how an ad might influence not just immediate sales but also brand perception or geographic impact.
Note: Establishing true causality is challenging and requires careful study design and analysis.
Intuitive explanation: It’s like figuring out if a new marketing campaign actually increased sales, and by how much (causal effect = lift). We usually compare a group that received the campaign (treatment) to a similar group that didn’t (control).
2. Counterfactual and Potential Outcome
Technical explanation: A counterfactual represents what would have happened to an individual if they had not received the treatment, serving as a crucial concept in causal inference. The potential outcome is the result for each possible treatment scenario, but since we can only observe one outcome, we rely on assumptions and methods to estimate the counterfactual.
Note: We can never directly observe counterfactuals, making their estimation challenging and leading to biased conclusions. But the simple idea of thinking in counterfactual terms gives us a sense of where our measurement strategy may fall short.
Intuitive explanation: It’s like asking, “What if we hadn’t run that ad campaign?” The counterfactual is the alternate reality where we didn’t, and potential outcomes include both what actually happened and what could have happened under different scenarios. That ad campaign may have had no actual effect; maybe we would have sold more even in the absence of the campaign due to factors such as seasonality.
3. Omitted Variable Bias and Selection Bias
Technical explanation: Omitted Variable Bias occurs when we fail to include in a model variables that affect both the treatment (e.g., receiving or not receiving an intervention) and the outcome, leading to inaccurate estimates of the treatment effect. Selection Bias happens when the participants in a study are systematically different from the target population, leading to unrepresentative or biased results.
Note: These biases are interconnected. If we could account for all relevant characteristics of individuals, we could control for these biases, thereby mitigating both omitted variable bias and selection bias.
Intuitive explanation: Suppose a company tests a new marketing strategy and sees a significant increase in sales. However, it doesn’t realize that the study included their most loyal customers, who are naturally more responsive to marketing. This oversight leads to selection bias. Additionally, if the analysis doesn’t account for customer loyalty, this may be seen as an omitted variable bias.
4. Imperfect Compliance, ITT and LATE
Technical explanation: Imperfect compliance occurs when not all participants of an experiment adhere to their assigned treatment, which may complicate the estimation of treatment effects. The compliance rate is the proportion of participants who adhere to the treatment assignment.
Note: Low compliance rates can reduce the effectiveness of randomization. When compliance is low, the analyses must account for this to avoid biased results, often contrasting the Intention-to-Treat effect (ITT) with the Local Average Treatment Effect (LATE), which is the effect on the compliers.
Intuitive explanation: In an experiment where some users of an app are assigned to see a new feature, not everyone assigned to see this feature actually uses it. This scenario represents imperfect compliance. Understanding this helps us differentiate between the effect of simply being shown the feature (ITT) and the effect on those who actually used it (LATE).
5. Attrition, Attrition Bias (also includes balanced vs. unbalanced panels)
Technical explanation: Attrition occurs when participants drop out of a study over time. This can lead to attrition bias if the dropouts are systematically different from those who remain, potentially biasing results (attrition bias).
Note: In that sense, two related terms are “balanced panels”, which have complete data for all participants across all time periods, and “unbalanced panels”, which have missing data points. Depending on the reason for the unbalance, this may lead to biased estimates.
Intuitive explanation: Think of attrition as people leaving a long-term survey. If those who leave are different from those who stay (e.g., unhappy customers), it can distort our understanding. It’s like trying to judge a restaurant’s quality when only satisfied customers leave reviews.