Unlocking the Power of Causal Inference in Machine Learning

A Beginner’s Guide to Uplift Modeling and Average Treatment Effect Estimation.

9 min readMar 18, 2023

Let’s start the journey of causal inference by machine learning with me. — Photo by Ilya Pavlov on Unsplash

Introduction: Prediction vs Causal Inference

Recently, there have been more and more discussions and business applications related to causal inference in the field of machine learning. Generally speaking, machine learning is good at predictive problems. For example, the recent popular ChatGPT, which is essentially an AI for solving predictive problem. Based on the question you ask, it “predicts” the answer you would like to know.

However, many business problems require causal inference, such as determining how much a promotional offer will increase a customer’s purchase amount. These types of problems cannot be solved by simply dumping all the features into a model. They require experimental design and an understanding of the concept of counterfactual inference.

Simple example on concept of uplift modeling

Here’s an example often used to explain causal inference to beginners:

Imagine you want to advertise a product and figure out which group of customers to target. You have data on past conversions for two groups, A and B. Who should you advertise to?

In general, if both user groups have similar numbers of people and spend similar amounts, it’s better to choose user group A because it has a higher advertising conversion rate.

However, based on the natural conversion rate (the probability of purchasing without advertising), advertising is actually more effective on user group B.

This means that just by looking at “purchase probability after advertising,” we can’t tell apart the following four types of customers:

Four types of customer segments

Persuadable: These are the people who won’t buy without seeing an ad, but will buy when they do see one. They are the audience that our ads most want to target!
Sure things: These are the people who will buy regardless of whether or not they see an ad. Therefore, advertising to this group is actually a waste of resources.
Lost causes: These are the people who won’t buy regardless of whether or not they see an ad. They can be considered a group of inactive customers, and simply advertising to them without other more effective ways of stimulating them would also be a waste.
Sleeping dogs: These people are a little different. They buy when they haven’t seen an ad, but don’t buy when they do see one. Therefore, advertising to these people would actually have a negative effect.

In customer group A, there might be more “Sure thing” customers who will make a purchase even without advertising. This means advertising is wasted. On the other hand, in customer group B, there might be more “Persuadable” customers who are more likely to be influenced by advertising.

Response model versus Uplift model

In machine learning terms, advertising is an intervention or treatment and the purchase rate is a response. This approach, which mixes four customer groups and only looks at the response (i.e. purchase rate) with intervention (i.e. advertising delivery), is called a response model. It only looks at the purchase rate of customers who have received advertising delivery.

An “uplift model” measures how much a person’s response changes when they are exposed to an intervention compared to when they are not exposed. The difference in response is called the “causal effect” of the intervention. For example, an uplift model can measure how much advertising affects the purchase rate compared to no advertising.

The math above talks about a big idea in causal inference called “counterfactual inference”. We call it “counterfactual” because at any moment, “the same person” has either seen an ad or not seen it, and we can only see one of these. The uplift model uses machine learning and experimental design to try to figure out what would have happened if the person had or hadn’t seen the ad.

Real-life usage of uplift modeling

This article only explains the uplift model conceptually. Later articles will cover how to apply the uplift model. To see how the uplift model is used by Line’s team in Taiwan and Uber, check out the following resources:

Uber’s data scientist explains the fundamental concepts and practical uses of “Uplift Modeling”:

Line uses uplift models to identify audiences that respond better to ads.(In Chinese)

Martech雙周報第16期：Line臺灣三大Martech應用一次看，靠AI找潛在客群與高價值用戶，更能預測用戶受廣告的影響程度

Google以新廣告追蹤技術Topics，取代遭到大家唾棄的FLoC Google去年初發表並開始測試的「群組聯合學習」（Federated Learning of…

www.ithome.com.tw

This section is about a book called Causal Inference for The Brave and True. It was written by Matheus Facure, a Staff Data Scientist at Nubank in Brazil. The book is easy to understand and has lots of memes. I recommend it to everyone who wants to learn about causal inference. This article and further articles are actually my notes on the chapters from the book:

This article: Basic concepts of causal inference
the 2nd article: Randomized trials, confidence intervals, and causal graph models
the 3rd article: Propensity scores and Doubly Robust Estimation
the 4th article: Meta Learners: S-learner, T-learner, X-learner
the 5th article: Debiased/Orthogonal Machine Learning or R-learner

You’ve learned a lot about uplift modeling. Time to take a break! — Photo by Todd Quackenbush on Unsplash

Understanding Causality

If you’re familiar with regression models, you’ve probably heard the phrase “correlation doesn’t mean causation.” This article will explain why and show how to turn correlation into causation.

Mathematical Symbols of potential outcome

Representation of treatment or intervention for unit i is as follows:

Now, let’s talk about the “potential outcome”. It might be a little hard to understand, so take your time. Imagine we have a group of things we want to study. Some of them will get a treatment, and others won’t. We can only see what happens to each thing either with the treatment or without the treatment. But we also want to know what would have happened if the opposite had been true. This is what we call the “potential outcome”. We can’t actually see it because it didn’t happen, but it’s still important to think about.

For instance, when you feel bad about making decision A in the past Ti=1, you might imagine what would have happened if you had not made that decision, which is the potential outcome Y0i. On the other hand, if you feel bad about not taking advantage of opportunity B Ti=0, you might imagine what would have happened if you had taken advantage of that opportunity, which is the potential outcome Y1i.

Causal Effect

Causal effect can be divided into the following types:

Individual Treatment Effect

Individual treatment effect refers to the use of the potential outcome concept to represent:

From the previous explanation, we can understand that in reality, we can only observe one of the two. This example is used solely to illustrate the concept of causal effects, also known as “counterfactual.”

Average Treatment Effect (ATE)

The average causal effect when considering a group as a whole is called the Average Treatment Effect (ATE):

Average Treatment Effect on the Treated (ATT)

Another measure that is similar to ATE, but focuses only on the treated group:

Conditional Average Treatment Effect (CATE)

This type of causal effect refers to the average treatment effect among individuals with similar characteristics, after taking various features into account. For example, if we want to achieve personalized advertising, we want to know which types of people the advertisements are more effective for (i.e. people who have certain characteristics that make them more responsive to the advertisement), so we can make more efficient use of advertising resources. The mathematical formula for CATE is as follows:

When treatment is a binary variable:

When the treatment is a continuous variable:

This type of causal effect is closely related to the development of machine learning in the field of causal inference. The purpose of many studies and applications is to use the powerful predictive ability of machine learning to estimate CATE.

Association, Causation, and Bias

Association can be understood as the degree to which Y changes on average when T changes. Mathematically, it is represented as follows:：

Notice that here $Y$ represents only the observed part(we observe Y1 when treated and Y0 when not treated), so we can transform it as follows:

Next, we need to use some tricks to introduce the concept of counterfactuals, by adding E[Y0|T=1] and then subtracting E[Y0|T=1]:

Reordering the sequence, we get:

Finally, after merging the terms, we get the following:

From the above reasoning, it can be seen that “association” is actually equivalent to “causation” plus a “bias” term. Why is it called a bias? It is necessary to first understand E[Y0|T=1], which is the result of the “counterfactual”, representing the state of the treated population “if they had not been treated”, while E[Y0|T=0] represents the state of the non-treated population. The difference between these two represents the meaning that there are differences between the treated and non-treated populations that already exist before treatment.

Using an example, if we observe that cities with higher police forces have higher crime rates, does it mean that having more police officers leads to more crime? Regardless of the possibility of collusion between the police and criminals, the reason we observe this phenomenon is likely due to the fact that cities with higher police forces already had higher crime rates before the police arrived. This leads to bias when we try to make causal claims based on observed correlations, resulting in the opposite conclusion.

Conversely, if we know that bias does not exist, that is, E[Y0|T=1] — E[Y0|T=0] = 0, then we can obtain:

In fact, if there is no bias, it means that the treated group and the non-treated group are very similar, and their differences only lie in whether they are treated or not (i.e., T itself). Therefore, the causal effect in the two groups will also be very similar. Therefore, if this condition is met, we can obtain: E[Y1 — Y0|T=1] = E[Y1 — Y0|T=0], in other words:

Conclusion

This article first briefly described the differences between response models and uplift models, and introduced the concept of counterfactuals and why it related to causal effects. Then, various causal effect terms and mathematical notations were briefly introduced, followed by an explanation of the relationships between association, causation, and bias. It was explained under what conditions a correlation relationship can be transformed into a causal relationship.

I will continue to organize the relevant contents from the book into Medium articles in the future. You are welcome to follow me and give me some encouragement by clapping for my posts!

If there are any errors or areas for discussion, please feel free to contact me. Here is my LinkedIn:

https://www.linkedin.com/in/pingchienlu/

References

The main reference for this post:

Causal Inference for the Brave and True

A light-hearted yet rigorous approach to learning impact estimation and sensitivity analysis. Everything in Python and…

matheusfacure.github.io

Other Medium articles about the uplift model:

A Quick Uplift Modeling Introduction

Learn how uplift modeling can improve classic data science applications.

towardsdatascience.com

Causal Inference with DoWhy:

Implementing Causal Inference: Trying to Understand the Question of Why

A tutorial on Causal Inference with DoWhy