Causal ML for Decision Making

Ryuta Yoshimatsu
15 min readOct 25, 2023

--

Introduction and Application

Image from Adobe Stock

This is an article on causal inference and decision making. There are two parts to this article. The first part introduces causal inference and explains why it is important for decision making. The second part focuses on how to apply causal inference to a real project. Here, I’ll present the four key steps of causal inference, which is an effective framework to structure your analysis end-to-end. You will see a detailed walk-through of the four steps applied to an actual use case. It’s a long post, so grab a coffee and make yourself comfortable!

What is causal inference and why is it important?

As a machine learning specialist, I’m all excited about the trend of predictive modeling becoming a more and more integrated part of what we do. What machine learning does is that it finds patterns in our data and helps us turn those patterns into decisions. For example, a fraud detection model learns correlational patterns between the features and the labels and gives us a heads-up whenever it sees a possibly fraudulent transaction. Based on this prediction, we take action. We can block that transaction or we can report it to someone. The predictive I/O feature of Databricks learns users’ query patterns and automatically configures Delta Lake tables for you for better performance. Machine learning is transforming the way we make decisions.

It is a great tool, but we shouldn’t forget that it is based on a fundamental assumption that the data it sees in training is representative of the data used for testing and the data it sees in production. When this assumption breaks, it makes mistakes. That’s why it’s important to track the performance of the model and the distribution of the variables after the model has been deployed. Whenever we detect a significant drift in any of the key variables or key metrics, we might want to retrain the model and relearn the correlational patterns that exist in the new dataset.

So machine learning relies on this fundamental assumption, and interestingly, this assumption leads to another problem when we combine it with decision making. When we use machine learning for decision making, we blindly assume that the decisions we are going to make or actions we are going to take will not break the correlational patterns that the model learned or it will not change the distribution of the dataset. This is not always true because sometimes we actively intervene with the environment and change the patterns.

Image from Adobe Stock

For example, think about the task of training a machine learning model that helps us make irrigation decisions. We’re going to train a model to predict soil moisture levels of a farm based on current readings and future weather forecasts. We train the model on years of data from a real farm. After the training, we will ask the model: “Here’s the status of my farm right now and it’s going to be hot for the next couple of days. Should I water my fields?” The model trained on years of data will likely say: “No, the soil moisture level will be high when the temperature goes up. So don’t worry about it.” Intuitively, this doesn’t make sense because you know that if it’s going to be hot, then that’s going to dry out the soil. Why did the model say this? What actually happened is that this model learned that in the past high temperatures were associated with high soil moisture levels because the farmers always watered on those hot days. That was the pattern picked up by the model and it’s the pattern we’re going to break when we decide not to irrigate based on the prediction.

Image from Adobe Stock

Another example is churn. You might have come across the famous IBM Telco dataset. It contains strong signals that let you build a model that predicts which customers will churn at a pretty high accuracy. It’s a great dataset for practicing machine learning. Now, imagine that this was not just a hypothetical case, but is an actual case that is happening in your business. You would then want to leverage these predictions and take preventive actions to retain your customers. The actions could be anything like giving a discount, sending an email, making a phone call, etc. Let’s say you managed to retain some customers. That means you’ve broken the correlational patterns the model learned. Now, suppose you found an effective way to convince your customers to stay and managed to retain most of the customers. Then, would the next version of the model that you’re going to train with these new data points no longer classify anyone as at risk (because that’s what actually happened)? These questions might sound trivial but they are all fundamental. Some decisions and some actions will directly break the patterns that your machine learning model relies on.

If we look at these two examples, we start seeing the type of task that machine learning is not great at. That’s when the task is to make good decisions and when those decisions lead us to intervene with the environment. This is because these interventions break the patterns that our machine learning learned. Plus, in order to make a good decision, we need to be able to estimate the effect of possible actions. So for decision making we clearly need something else. Instead of just predicting the target variable, we need to find the variables that cause the outcome. We also need to be able to estimate how that outcome would change, if we changed these variables. This is causal inference.

Simple causal graph, where T is the treatment, Y is the outcome and X is the confounder.

A practical definition of causality is the following. We say that a treatment T (can also be a decision or an action) causes an outcome Y if and only if changing T leads to a change of Y while everything else is kept constant.

On the left is the real world and on the right is a counterfactual world

Let’s say in the past, we took action and set T to 1 and we have an outcome Y(T=1). Now, imagine a world, where T was not changed (T is set to 0) while everything else was kept constant. We call this world a counterfactual. If Y(T=0) would have been different from Y(T=1), the causal effect is the magnitude by which the outcome Y changes: i.e. Y(T=1)-Y(T=0). These are the practical definitions of causality and causal effect.

This definition leads to two fundamental challenges of causal inference. The first challenge comes from the fact that we never actually observe the counterfactual. We get to see what we did, but we don’t get to see what we didn’t do. So we can’t directly measure the causal effect, which is defined as the difference between the real world and the counterfactual. We therefore have to estimate the counterfactual. This means that validation of the method is going to be challenging because there is no ground truth. We will look into how we can still validate our method in the second part.

The second challenge is that given an observational dataset there are multiple causal structures that can be fit to it. Think of a causal mechanism as a DAG where the nodes are the features and the causal relationships are expressed as the edges with directions. This second challenge is saying that the data alone may not be enough to find the right DAG that captures the right causal structure. We need domain knowledge and we need assumptions to disambiguate the true structure that caused the observation. These are the two important challenges that many causal inference solutions are designed to deal with. We will look into these more in detail in the second part.

Let me summarize the first part.

Decision making depends on understanding the effects of decisions on the outcome. If we want to optimize how we make decisions, we need to think about causality. The predictions from machine learning are not enough because our decisions and actions can change the associative patterns that machine learning depends on. Causal inference addresses this challenge directly but it does run into new challenges. Firstly, we have to start bringing in external knowledge and augment our observational dataset. Secondly, we need new methods to estimate the counterfactuals and we need new methods to validate those estimates.

That was it for the first part!

In the second part, we’re going to go through the four key steps of causal inference. These steps are the following.

The first step is called causal modeling or causal discovery. This is about creating a causal graph that encodes our assumptions. We will be augmenting our observational dataset with our domain knowledge. The second step is called identification. Here, we’ll take that causal graph from the previous step, and formulate what we need to estimate and how we do that. Given a causal graph, there could be multiple approaches we can take to isolate the effect of a treatment on an outcome. Each approach will require a different set of variables to control for. So this step is about identifying that approach. The third step is to actually estimate the effect. We will take the approach we identified in the previous step, and apply various statistical techniques to estimate the causal effect of a treatment on an outcome. The final step is to validate our assumptions. We will be running a bunch of sensitivity analysis tests to see if we can refute our estimators. The four key steps are modeling, identification, estimation and refutation.

To implement these four steps, we will use a python package called DoWhy. DoWhy is an open source project for causal inference developed by Microsoft Research. It makes it really easy for us to structure our analysis along the four key steps. It is especially powerful in carrying out the first and the final steps by ensuring a transparent declaration of assumptions and by offering many out-of-the-box validation tools to test our assumptions. It’s also extensible and you can use different graph tools and different packages like sklearn, EconML, CausalML for the effect estimation. It’s one of the most popular causal inference frameworks in the python ecosystem.

Before jumping on to see how causal inference works in action, let me explain what our use case is about. This use case is built around a fictitious software company that has been providing several types of promotional offers to their customers. They have been doing so for a year without any controls over which customer gets which offers. Of course, these offers carry costs in terms of discounted revenues or service fees that need to be covered. If the incentive works, it can help ensure the closure of the deal, expand the size of the purchase, or increase the velocity of the deal. But if it doesn’t work, there will be no impact on the deal and it will just end up eroding the profit margin. It’s important for this software company to be able to estimate the effect of their promotional offers on the purchase behavior of the customers. And they want to maximize the profit margin of each customer. We are going to build a personalized recommendation system for promotional offers using causal inference following the four key steps.

Introduction

We structured this project into four notebooks and a few auxiliary files and scripts. They are publicly available here. Let’s first look at the introduction notebook. Here, it briefly mentions that the gold standard for establishing causality between a treatment and an outcome is a randomized control trial. But often this is not a preferred option because of cost and complexity, and there could also be ethical concerns depending on the context. Therefore, we take an approach to infer the causal effect using an observational dataset.

Snapshot of the dataset

This is the structure of the dataset we collected over a year. During this year there was no policy regarding which customer receives which incentives. In this use case, “Revenue” is the outcome, which we want to estimate the effect of our treatments on. The treatments we are going to assess are “Tech Support”, “Discount” and “New Engagement Strategy”. These are all binary fields indicating if the customers received these treatments or not. Each treatment has its internal cost associated with it. For example, “Tech Support” costs $100 per licensed PC. The rest of the features capture some characteristics of the customers. For example, “PC Count” is the number of PCs used by the customers and “Size” is the customer’s total revenue. See the notebook for more details.

An important thing to note is that this dataset has been simulated, so we actually do have the ground truth and we know the true causal relationships between the variables as well as the true size of the effect of each treatment and each confounder.

Let’s go through the four steps.

Causal Discovery

First step is to model the problem. This step focuses on discovering the network of influences that exist between the features. We will eventually enrich the graph using our domain knowledge.

Snapshot of causal-learn

Here, we are using a package called causal-learn and an algorithm called PC implemented in this package to do an automatic causal discovery, which gets us a skeleton of the network. I won’t go into the details of the PC algorithm, but it belongs to a group of methods called constraints-based causal discovery. Indeed, the PC algorithm does already give us some interesting insights.

Snapshot of the first causal graph generated by causal-learn

“Discount” seems to have a direct impact on “Revenue”. “Tech Support” seems to have a direct impact on “Revenue” as well, but also a mediated effect through “New Product Adoption”. “New Engagement Strategy” doesn’t seem to influence “Revenue”, but both “Revenue” and “New Engagement Strategy” seem to impact “Planning Summit”. This is called a collider pattern, which could lead to a spurious correlation between “New Engagement Strategy” and “Revenue” if conditioned on the collider variable. This is something we need to be aware of when we estimate the effect of the treatment. In practice, we just have to make sure that we exclude this variable from the analysis to prevent an unwanted bias from sneaking in. We still have lots of edges and directions missing in the graph, so we will be manually adding them based on what we know and what we assume.

Snapshot of the final causal graph

This is our final causal graph. We will then store this graph as an artifact in MLflow, so that we can later download it and use it in the next step. This completes the first step.

Identification / Estimation

The second step is for identifying the best method to isolate the effect of each treatment using the graph defined in the previous step. The method will determine which features need to be controlled for when estimating the effect.

Snapshot of DoWhy

After we load the graph from MLflow, we instantiate DoWhy’s CausalModel class object by passing on the dataset, the graph, the treatment name and the outcome name. We then call the identify_effect method.

Result of identify_effect method of DoWhy

The resulting object tells us which method we should use and which variables we should control for. In this case, DoWhy suggests using the backdoor criterion and controlling for these confounders. Under the hood, DoWhy performs Do-Calculus on the causal graph, which is an axiomatic framework that lets us express the intervention graph as a conditional probability. It’s ok not to fully understand this part to work with causal inference. That’s why there are amazing tools like DoWhy. So please don’t shy away!

Snapshot of estimate_effect method of DoWhy using EconML Double ML

After we identify the method, we go ahead and estimate the effect. This is the third step. For this, we will use a statistical technique called Double Machine Learning, which is already implemented in a package EconML. In a nutshell, it is a technique that consists of two steps and lets you factor out the association between the confounders and the treatment, and the confounders and the outcome. It’s a clever way of getting an unbiased estimator. The effect modifiers are the features that give us the heterogeneous treatment effect. What that means is that after our estimator is trained, we can ask our estimator to estimate the effect of a treatment given the value of the effect modifiers. If you dig a little into the equation, it’s fascinating to see how these effect modifiers are incorporated as a simple interaction term with the treatment variable in a simple linear regression model.

You can also use other techniques here for estimating the effect of a treatment. Some important and popular ones include inverse propensity weighting, synthetic control methods and instrumental variable analysis. These are all implemented in EconML and readily available in DoWhy. The best technique depends on your causal structure and your dataset. We can do the same for the other two treatments and here DoWhy actually finds that the effect of “New Engagement Strategy” on the outcome is zero.

Personalized Promotional Offer Recommender

We have now two trained estimators: one for the treatment “Discount” and the other for the treatment “Tech Support”. We will remove “New Engagement Strategy” out of our analysis because it has no effect on the outcome. With the two treatments in our hands, we can craft four different strategies for promotional offers. The first one is to give nothing. The second one is to give “Discount”. The third one is to give “Tech Support”. And the last one to give both “Discount” and “Tech Support”. This notebook loads the two estimators from MLflow and creates a composite model. This composite model estimates the effect of all four strategies given the effect modifiers of a given customer, takes the cost associated with each strategy, and returns the best strategy that maximizes the profit for that customer.

Snapshot of scatter plot

This scatter plot is the result of this exercise. Each customer is represented as a dot (only showing a subset of customers) and they are color coded based on the recommended strategy. You can see the strategy dependence on the effect modifiers.

Snapshot of the average mean marginal profit

If you compare the mean marginal profit per customer between the no policy scenario and the recommended policy scenario, the difference is rather large.

Refutation

Maybe the most important step in the analysis is to check if we actually have a robust estimator or not. For this, we will be running multiple tests. The basic idea of these tests is to gradually inject noise or distort the dataset and to see at which point our estimators become no longer valid. There are many tests implemented here, but I’m going to focus on only two.

Snapshot of add_unoberved_common_cause refuter and its result

The first test adds an unobserved confounder into the graph. It tests how sensitive our estimator is to an unobserved confounding. This test doesn’t give you a pass or no pass answer but it gives you a sense of how robust your assumptions are to even a small confounder that might have gone missing in your data. The heat map gives you a subjective understanding of whether the estimator can be trusted or not. The tool also gives us the variation of estimated effect when the size of the injected effect from the hidden confounder is varied. Here, we can see that at least, for this range of effect, we still have a net positive effect from our treatment on the outcome.

Snapshot of placebo treatment refuter refuter and its result

The second test is called a placebo treatment refuter. In this test, we are randomizing the order of the treatment column in the dataset and essentially breaking the causal relationship between the treatment and the outcome. With this new dataset, we re-estimate the effect of the treatment on the outcome, which should be zero. We then calculate the p-value using two sample t-test. Here, we are going to support the hypothesis that states that the treatment has an effect on the outcome.

That was it for the second part!

Summary

Let me quickly wrap up this article.

In this blog post, we started off looking into the problem of machine learning and decision making. We then discussed how causal inference can help us with this and also help us estimate the causal effects of treatments on an outcome. We learned that there are two fundamental challenges with causal inference. We then looked into the four key steps of causal inference. These steps were modeling, identification, estimation and refutation. We saw how this can be implemented in DoWhy and applied in an actual use case.

Acknowledgement

I thank the Microsoft Research team for making tremendous contributions to this field. For this blog post, I borrowed a lot of content from this amazing webinar. I thank Luis Moros and Corey Abshire from Databricks and PyWhy Steering Committee for the collaboration on this work. And I thank all the readers for staying with me on this long post!

--

--