Beginner’s guide to causal discovery: The what, the why, and the how

A simple explanation with business applications and an example Python demo

Ganga Meghanath
Data Science at Microsoft
8 min readJun 4, 2024

--

Let’s embark on a detective’s journey to uncover causal graphs from observational data and decipher them with domain expertise. (Image generated by Image Creator from Microsoft Designer on Bing.com).

Welcome to a beginner’s guide to causal discovery! If you’ve been hearing about causality and are wondering “What’s this all about? How can uncovering causal relations from data help me?” Then you’re in the right place!

Let’s outline what we cover in this article:

  1. What is causal discovery? We dive into what is meant by a causal graph and how it can provide valuable insights through a real-world scenario.
  2. How can businesses benefit from causal discovery? Building on what we cover with the basics, we discover why causal discovery is worth your time and how it can be a powerful tool for product and business owners.
  3. Hands on with causal discovery: A Python tutorial. Here we do some coding to create causal graphs and try to make sense of them!

Now let’s dive right in!

What is causal discovery?

Think of causal discovery as playing the role of a detective. It involves examining data and deducing how variables are causally linked to each other. In other words, it’s about identifying the underlying causal model that generates the data. This causal model is often captured in the form of a causal graph!

To better understand this concept, let’s consider a real-world scenario and explore what a causal graph looks like and the valuable insights it can provide.

Causal graph explanation adapted by the author from LUCAS (ethz.ch) with icons from (icons8.com).

In a causal graph, consider the nodes as different things (columns) in our data, and the edges as arrows that show how these things affect each other. For example, if coughing can cause fatigue, we’d draw an arrow from “Coughing” to “Fatigue.” This is how a causal graph captures the cause-and-effect relationships in our data. Let’s delve into the kind of insights we can get from a causal graph!

1. Identifying causal influences: A causal graph can help pinpoint which variables have a causal impact on a particular outcome. For instance, in the example below, factors like “Anxiety” or “Peer Pressure” could lead to “Smoking,” and both “Smoking” and “Genetic” predispositions can lead to “Lung Cancer.” However, being “born on a specific day” has no causal influence on smoking or any of the other nodes. This is the kind of cause-effect relationship that a causal graph encapsulates.

Identifying causal influences

2. Tracing causal paths: A causal graph can also help trace the path of causality. For example, “Anxiety” could lead to “Smoking,” which in turn could lead to “Lung Cancer,” causing symptoms like “Coughing” and “Fatigue.”

Tracing causal paths

3. Uncovering common causes: Causal graphs can reveal confounders, also known as common causes. In our example, “Lung Cancer” is a confounder because it is a common cause of both “Coughing” and “Fatigue.” Additionally, “Coughing” can also cause “Fatigue.” So, if someone is experiencing fatigue and coughing, there might be a third variable causing both symptoms. It is important to understand that correlations between variables must NOT be mistaken for causation. In our example, if we observe that patients with “Lung Cancer” also have “Attention Disorder,” it is likely caused by a common cause such as “Genetics.” Therefore, it is incorrect to assume that people with lung cancer will have attention disorder!

Uncovering common causes

Now that we’ve explored what a causal graph is, what it looks like, and the kind of information it can provide, let’s delve into how we can leverage causal discovery for our benefit in real-world business scenarios.

How can businesses benefit from causal discovery?

Causal discovery can be a powerful tool for businesses by providing insights that go beyond mere correlations. Causal graphs can help provide focus on the signals that matter!

By discerning the causal factors that influence our outcomes, causal graphs can empower decision-makers to identify key drivers of success and areas of improvement.

Let’s explore three main business aspects that can really gain from the power of causal discovery. We use a fictional candy-making company called “CandyCrushers” to better understand each aspect.

1. Strategic planning

Think of causal graphs as our business compass! They can help us identify key drivers of business outcomes. With this knowledge, we can plan our strategy more effectively, focusing on the key drivers.

CandyCrushers causal question: What factors can increase our candy sales?

Factors considered: Sales data, popularity of different candy types, effectiveness of advertising campaigns, pricing strategy, customer demographics, seasonal trends, and product placement in stores.

Fictional results: The analysis reveals that effective advertising, competitive pricing, and offering popular candy types are strong causal drivers of sales. This means CandyCrushers could focus on enhancing their advertising strategies, maintaining competitive prices, and producing more of the popular candies to boost sales.

2. Risk management

Causal graphs can also be our radar! They can help us identify the root causes of problems in our business operations (e.g., system crashes or performance bottlenecks). If a certain action or condition is causing issues, we can focus our efforts on fixing them.

CandyCrushers causal scenario: What factors can lead to a decrease in our candy sales and how can we mitigate them?

Factors considered: Frequency of candy machine breakdowns, ingredient availability, customer reviews, market trends, changes in consumer preferences, and economic conditions.

Fictional results: The analysis identifies frequent candy machine breakdowns, ingredient shortages, and shifts in consumer preferences as main causes of decreased sales. This suggests that CandyCrushers should ensure regular maintenance of their candy machines, efficient inventory management to avoid ingredient shortages, and stay updated with changing consumer preferences to mitigate these risks.

3. Product development

Causal discovery can help us identify our secret sauce! It can help us identify the features or changes that make the biggest difference to user satisfaction or product performance. So, we know where to focus our development efforts for the biggest impact.

CandyCrushers causal scenario: What new candies should we develop to increase customer satisfaction and boost sales?

Factors considered: Customer feedback, requests for new candies, ingredients, similarity to popular candies, sales data, market trends, and competitor products.

Fictional results: The analysis suggests that developing new candies similar to the already popular ones or those lacking in competitors’ product lines can increase customer satisfaction and boost sales. This insight can guide CandyCrushers’ product development efforts, helping them focus on creating candies that their customers will love.

Business applications of causal discovery

Therefore, by providing an understanding of the cause-and-effect relationships from data, causal graphs can help us make smart informed decisions when it comes to a product or business.

Hands on with causal discovery: A Python tutorial

Now let’s code and build a causal graph!

In this section we explore how to construct causal graphs from data. Remember, there’s a plethora of causal discovery methods available, each with its own assumptions. Choosing the right one is important, as it should align with the nature of the data and the specific problem being tackled. We won’t get too technical here, but just know that the selection process is significant.

For our hands-on example, we revisit the LUCAS (ethz.ch) dataset mentioned at the beginning of the article, where we already know the ground truth causal graph. We employ the DirectLiNGAM method to learn the causal graph.

First, let’s set up our environment. If you’re following along on your computer, open Terminal or the Command Prompt (for Windows users) or a Jupyter Notebook. We start by installing the lingam package. Here’s the command to use:

$ pip install lingam

You can download and install Graphviz on your machine to visualize the causal graph.

With our package installed, we’re ready to start coding. We start by importing the required libraries and loading our dataset. Then, we apply the DirectLiNGAM method to learn the causal graph from our data. Here’s the step-by-step process:

  1. Download the data from LUCAS (ethz.ch). Please note that I’ve omitted a variable from the original Lucas dataset to stay consistent with the initial graph example.
  2. Open a Jupyter notebook and read the downloaded data into your script. Here’s how you can do it:
import numpy as np
import pandas as pd
import graphviz
import lingam
from lingam.utils import make_dot

np.set_printoptions(precision=3, suppress=True)
np.random.seed(100)

X = pd.read_csv('./lucas0_train.csv')
X.head()

Now let’s proceed to learn the adjacency matrix that represents the causal relationships in the data.

model = lingam.DirectLiNGAM()
model.fit(X)

Now that our model is trained, we use graphviz to visualize the output causal graph.

make_dot(model.adjacency_matrix_, labels=np.array(X.columns))

And we get:

Output causal graph from DirectLiNGAM causal discovery method

It’s not 100 percent correct, but it’s on the right track, capturing most connections accurately. I’d say it’s a pretty decent output causal graph.

Now how do we make it better and more accurate? This is where we as humans come in with our domain knowledge! But, of course, if the domain experts knew what the true causal graph was, then we wouldn’t need algorithms to “discover” it. Conversely, without any domain knowledge, we’d have to rely solely on what the algorithms deduce from the data.

Typically, we find ourselves in a sweet spot — armed with partial domain knowledge, we lean on causal discovery algorithms to sketch out the causal graph from data. For instance, we’re certain that yellow fingers don’t lead to smoking. By integrating such insights as prior knowledge, we can fine-tune the graph to better reflect reality.

Conclusion

As I wrap up this article, I suggest taking a moment to appreciate the power and potential of causal discovery. It’s not just about algorithms and data; it’s about the insights we gain and the decisions we make based on those insights. And while the algorithms are impressive, they’re not perfect. That’s where we as human beings come in, with our expertise and domain knowledge to refine the output causal graph and align it more closely with reality.

I hope you found this article helpful; it’s my first. Feel free to leave reactions in the Comments section.

Ganga Meghanath is on LinkedIn.

--

--