Hands-on Causal Discovery with Python

Jakob Runge
Causality in Data Science

--

A Gentle Guide to Causal Inference with Machine Learning Pt. 9

You probably read all kinds of articles explaining the fundamentals of causal inference and its connection to machine learning by now. “This might sound nice”, you could think, “but how can I use it?!”. This is a valid and important question. A question that will be answered in this blog post.

More precisely, you will learn how you can use the PCMCI algorithm to identify causal graphs out of high-dimensional time-series data. An indeed difficult task. Difficult, but not impossible. Especially with the right packages to use. So let’s do it!

Before you keep on reading, please make sure you are familiar with the following concepts:

  1. Causal Markov, Faithfulness and Sufficiency Assumption
  2. Fundamental idea behind Causal Inference
  3. The reason why Machine Learning needs Causality
  4. Causal Graphs & D-separation
  5. The PCMCI algorithm

Getting started:

Alright, we will get our hands dirty now. Finally, you will learn how to do algorithmic causal inference.

To do so, we are going to use the Tigramite package. Being the go-to guy for full-stack causal inference with time-series data, Tigramite provides continuously maintained and further developed state-of-the-art solutions with low package dependencies (GitHub repository).

You can install it via pip. Make sure the Python version in your environment matches the requirements which you will find here updated regularly.

pip install tigramite

This being done, you can now proceed to import all the packages you need for your analysis (the list below may be more than you need):

# Imports
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
## use `%matplotlib notebook` for interactive figures
# plt.style.use('ggplot')
import sklearn

import tigramite
from tigramite import data_processing as pp
from tigramite.toymodels import structural_causal_processes as toys

from tigramite import plotting as tp
from tigramite.pcmci import PCMCI
from tigramite.lpcmci import LPCMCI

from tigramite.independence_tests.parcorr import ParCorr
from tigramite.independence_tests.robust_parcorr import RobustParCorr
from tigramite.independence_tests.parcorr_wls import ParCorrWLS
from tigramite.independence_tests.gpdc import GPDC
from tigramite.independence_tests.cmiknn import CMIknn
from tigramite.independence_tests.cmisymb import CMIsymb
from tigramite.independence_tests.gsquared import Gsquared
from tigramite.independence_tests.regressionCI import RegressionCI

One last remark before we start. We will have to work with a couple of indices here. This can be messy — especially in Medium. If we write X(j)_(t-tau) we mean the value of the i-th variable in our system, tau timesteps in the past from the present timestamp t. Or simply:

Alright, now we can start and do some fun causal discovery!

What we want to model:

Causal systems can be represented by SCMs (see here). Of course, in reality, you would not have perfect information about the SCM that underlies your data. Instead, we will only have a bunch of observations of - if you assume causal sufficiency - all relevant variables of the underlying system. Meaning, that if your system can be sufficiently described by 4 variables, you have data about each one of them. Causal discovery should now help you to bring some light into this black box and identify causal relationships.

We will do this with an example now. In this example, contrary to reality, we already know from which SCM the data is generated, making it possible for us to check if the results we obtain from using PCMCI are correct. This is how our SCM with η(j)_(t) as independent zero-mean unit variance random variables looks like:

Hopefully, your data is a “random” sample of data points from the distribution over the SCM. We will now do the same and generate such data from our SCM. This means we randomly pick a value for η(j)_(t) for each of the j variables and each time step t, and as a result obtain values for each variable. As we observe a time-dependent system, each of those samples stands for one point in time, with the next point always “building upon” the previous one. We do this a thousand times in the code below to generate a time series of 4 variables with a length of 1000 using Tigramites toys.structural_causal_process function.

np.random.seed(42)     # Fix random seed to make results reproducable
def lin_f(x): return x # A linear dependency function
links_coeffs = {0: [((0, -1), 0.7, lin_f), ((1, -1), -0.8, lin_f)],
1: [((1, -1), 0.8, lin_f), ((3, -1), 0.8, lin_f)],
2: [((2, -1), 0.5, lin_f), ((1, -2), 0.5, lin_f), ((3, -3), 0.6, lin_f)],
3: [((3, -1), 0.4,
} #stores the functional dependencies of the SCM
T = 1000 # time series length
data, true_parents_neighbors = toys.structural_causal_process(links_coeffs, T=T) #generate the timeseries
T, N = data.shape

# Initialize dataframe object, specify time axis and variable names
var_names = [r'$X^0$', r'$X^1$', r'$X^2$', r'$X^3$']
dataframe = pp.DataFrame(data,
datatime = {0:np.arange(len(data))},
var_names=var_names)

Alright, now that we have generated our data let’s take a look at it with the tp.plot_timeseries function.

tp.plot_timeseries(dataframe); plt.show()

Yes, of course, the data is quite perfect. Usually, in real-time-series data, you will have problems such as non-stationarity, missing data or different time granularities. We won’t deal with these problems now and just focus on the basics.

Alright so now the data is quite clear. The challenge at hand is to use PCMCI to detect the correct causal links — which requires that its underlying assumptions are fulfilled, see Causal Markov, Faithfulness and Sufficiency Assumption. PCMCI also requires no instantaneous causal links. Before getting to that let’s quickly understand on a high level what PCMCI does.

A quick recap about PCMCI:

Causal Discovery with PCMCI can be split into a process with two phases.

Identifying a superset of the parents for all variables X(j) in the system using a variant of the PC algorithm. As outlined in another blog post, PC is a flexible discovery algorithm, giving us an estimated set of parents per node that can account for nonlinear functional dependencies as well as continuous and discrete variables, depending on the chosen conditional independence test. Phase 1 is a kind of pre-selection step, that will help make the second step more efficient.

Using MCI (Momentary Conditional Independence) to iteratively test conditional independencies. Given our background knowledge about the Causal Markov condition, we know that the causal parents of a variable are a sufficient conditioning set when testing for causal relationships as conditional dependencies because here no instantaneous causal links are assumed. Consequently, we can now use the parents identified in phase one, and use MCI to test whether “X(i)_(t-tau) -> X(j)_t” with:

Initializing & Running PCMCI

When using PCMCI we have to make a couple of decisions that you probably spotted in the quick recap above or the blog post on PCMCI.

First, we need to choose a conditional independence test for phase 1 and phase 2. Assuming our cause and effects to be linearly dependent we will use the partial correlation test specified with significance=’analytic’ that assumes the null distribution to be Student’s t and then initialize PCMCI using the defined independence test.

parcorr = ParCorr(significance='analytic')
pcmci = PCMCI(
dataframe=dataframe,
cond_ind_test=parcorr,
verbosity=1)

The next decision we need to make is to pick the maximum time-lag tau_max on which we expect the longest possible causal link.

Without some theoretical a priori knowledge this decision might be quite hard. What helps here is to plot the lagged correlations of the time series to see at what lag no correlations exist anymore, making those lags irrelevant. This can be done using the following command (where tau_max is chosen to be very large to see a big timeframe):

correlations = pcmci.get_lagged_dependencies(tau_max=20, val_only=True)['val_matrix']
lag_func_matrix = tp.plot_lagfuncs(val_matrix=correlations,
setup_args={'var_names':var_names,
'x_base':5, 'y_base':.5}); plt.show()

What we can see from the plots is that the correlations decay at a lag of around 8. Consequently, we pick tau_max=8. If you assume nonlinearities, you would have to use mutual information or a similar measure for this step.

That’s almost it. We now decide what kind of independence test we want to use on what kind of time range. The last thing to determine significant links now is to define when we perceive a dependence to be strong enough such that the null hypothesis can be rejected in the Phase 1 selection step. This significance value is chosen using the parameter pc_alpha, with a default value of 0.05, setting the significance level in the condition-selection step. Instead of just picking one value at random, we can also let PCMCI choose the optimal value by setting it to pc_alpha=None. Then PCMCI will optimize this parameter in the ParCorr case by the Akaike Information criterion among a reasonable default list of values (e.g., pc_alpha = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5]). The last remaining parameter we need to specify now is alpha_level=0.01 which indicates that we threshold the resulting p-value matrix in the Phase 2 step at this significance level to obtain the graph.

With this, we are done and can run PCMCI to do the causal discovery for us. You can now run the algorithm and watch what happens:

pcmci.verbosity = 1
results = pcmci.run_pcmci(tau_max=8, pc_alpha=None, alpha_level=0.01)

Interpreting the Output

The output you get is quite long, so let’s make sure we understand what we see.

First, you see an output related to the PC phase. This includes first the parameters you have chosen:

##
## Step 1: PC1 algorithm with lagged conditions
##
Parameters:
independence test = par_corr
tau_min = 1
tau_max = 8
pc_alpha = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5]
max_conds_dim = None
max_combinations = 1

Next, you see the outcomes of the PC1 algorithm. This includes the discovered parents for each variable, sorted by their p-value.

## Resulting lagged parent (super)sets:
Variable $X⁰$ has 7 link(s):
[pc_alpha = 0.3]
($X⁰$ -1): max_pval = 0.00000, min_val = 0.796
($X¹$ -1): max_pval = 0.00000, min_val = -0.748
($X³$ -2): max_pval = 0.13715, min_val = 0.048
($X³$ -1): max_pval = 0.19326, min_val = -0.042
($X²$ -1): max_pval = 0.20477, min_val = -0.041
($X³$ -7): max_pval = 0.25822, min_val = 0.036
($X¹$ -4): max_pval = 0.26548, min_val = 0.036
Variable $X¹$ has 4 link(s):
[pc_alpha = 0.3]
($X¹$ -1): max_pval = 0.00000, min_val = 0.695
($X³$ -1): max_pval = 0.00000, min_val = 0.484
($X³$ -7): max_pval = 0.14660, min_val = 0.046
($X²$ -1): max_pval = 0.22144, min_val = -0.039
Variable $X²$ has 4 link(s):
[pc_alpha = 0.2]
($X²$ -1): max_pval = 0.00000, min_val = 0.450
($X¹$ -2): max_pval = 0.00000, min_val = 0.447
($X³$ -3): max_pval = 0.00000, min_val = 0.183
($X³$ -1): max_pval = 0.12516, min_val = 0.049
Variable $X³$ has 1 link(s):
[pc_alpha = 0.05]
($X³$ -1): max_pval = 0.00000, min_val = 0.372

Of course, the parents above are not the final result of PCMCI. They are rather the set we will condition on in the second phase. The second big part of the output shows you the result of the MCI process. The output is again structured similarly to the PC-related output. First, you see the parameters and then the discovered links of each variable sorted by its p-value and described by the latter together with the test statistic value (here partial correlations).

##
## Step 2: MCI algorithm
##

Parameters:

independence test = par_corr
tau_min = 0
tau_max = 8
max_conds_py = None
max_conds_px = None

## Significant links at alpha = 0.01:

Variable $X^0$ has 3 link(s):
($X^1$ -1): pval = 0.00000 | val = -0.653
($X^0$ -1): pval = 0.00000 | val = 0.566
($X^1$ -5): pval = 0.00496 | val = -0.090

Variable $X^1$ has 2 link(s):
($X^3$ -1): pval = 0.00000 | val = 0.663
($X^1$ -1): pval = 0.00000 | val = 0.622

Variable $X^2$ has 3 link(s):
($X^3$ -3): pval = 0.00000 | val = 0.451
($X^1$ -2): pval = 0.00000 | val = 0.446
($X^2$ -1): pval = 0.00000 | val = 0.425

Variable $X^3$ has 1 link(s):
($X^3$ -1): pval = 0.00000 | val = 0.372

Alright, let's see the kind of links we discovered.

If we compare our learned graph with the true SCM, we only discovered one erroneous link namely the one at X(1)_(t-5).

If we are a little more strict and set our significance level to 1%, we have no False Positives left and discover the correct graph. Try it yourself as an exercise!

Once you ran PCMCI and stored the results, you can always re-access them by printing out the p_matrix and val_matrix which are of the size (N, N, tau_max+1) with the entry (i, j, \tau) denoting the test for the link between X(i)_(t-tau) and X(j)_t.

print("p-values")
print (results['p_matrix'].round(3))
print("MCI partial correlations")
print (results['val_matrix'].round(2))
p-values
[[[1. 0. 0.262 0.894 0.886 0.034 0.417 0.255 0.877]
[0.067 0.296 0.095 0.765 0.502 0.242 0.234 0.899 0.951]
[0.832 0.657 0.453 0.011 0.359 0.48 0.901 0.243 0.375]
[0.827 0.656 0.495 0.592 0.643 0.976 0.775 0.931 0.759]]

[[0.067 0. 0.15 0.51 0.041 0.005 0.559 0.533 0.392]
[1. 0. 0.911 0.07 0.576 0.556 0.175 0.732 0.741]
[0.651 0.764 0. 0.231 0.495 0.43 0.423 0.944 0.238]
[0.153 0.171 0.088 0.989 0.965 0.859 0.915 0.43 0.671]]

[[0.832 0.1 0.182 0.443 0.37 0.265 0.998 0.89 0.65 ]
[0.651 0.072 0.594 0.387 0.883 0.765 0.696 0.44 0.69 ]
[1. 0. 0.074 0.942 0.956 0.626 0.818 0.526 0.942]
[0.788 0.575 0.332 0.658 0.603 0.65 0.816 0.055 0.851]]

[[0.827 0.03 0.021 0.597 0.635 0.2 0.392 0.081 0.85 ]
[0.153 0. 0.558 0.534 0.72 0.284 0.798 0.016 0.162]
[0.788 0.114 0.781 0. 0.975 0.098 0.265 0.161 0.642]
[1. 0. 0.937 0.85 0.503 0.657 0.394 0.795 0.567]]]
MCI partial correlations
[[[ 0. 0.57 0.04 0. -0. 0.07 0.03 -0.04 -0. ]
[-0.06 -0.03 0.05 0.01 0.02 -0.04 0.04 0. -0. ]
[ 0.01 -0.01 0.02 0.08 -0.03 -0.02 -0. -0.04 0.03]
[ 0.01 0.01 0.02 0.02 0.01 -0. -0.01 -0. -0.01]]

[[-0.06 -0.65 0.05 0.02 0.07 -0.09 0.02 0.02 0.03]
[ 0. 0.62 0. -0.06 -0.02 -0.02 0.04 -0.01 0.01]
[ 0.01 -0.01 0.45 0.04 0.02 0.03 -0.03 -0. -0.04]
[-0.05 -0.04 -0.05 0. -0. -0.01 -0. -0.03 0.01]]

[[ 0.01 -0.05 -0.04 0.02 0.03 0.04 -0. 0. -0.01]
[ 0.01 0.06 0.02 -0.03 -0. -0.01 -0.01 -0.02 -0.01]
[ 0. 0.43 -0.06 0. -0. 0.02 0.01 0.02 0. ]
[-0.01 -0.02 0.03 -0.01 -0.02 -0.01 0.01 0.06 0.01]]

[[ 0.01 -0.07 0.07 -0.02 0.02 0.04 -0.03 0.06 -0.01]
[-0.05 0.66 0.02 -0.02 -0.01 0.03 0.01 0.08 -0.04]
[-0.01 0.05 -0.01 0.45 -0. -0.05 0.04 0.04 -0.01]
[ 0. 0.37 -0. -0.01 -0.02 -0.01 0.03 -0.01 0.02]]]

If you look at the test statistic value (val) right now, you might wonder whether it serves as a good indicator of the strength of the causal effect. And your intuition is not misleading, it can indeed be a good indicator. Yet, it does not replace a sound causal effect estimation procedure, which we will discuss in the next article of this series.

Give it some colour

Great, now you have a mess of numbers. Let’s visualize it to make it easier for people to see what kind of message you want to bring across.

To do so Tigramite offers two kinds of plots both being created with very similar functions that take the graph array of the results, the val_matrix and the var_names as inputs. More precisely, you can choose between a time series causal graph or a process causal graph that summarizes the first one. Both can be very helpful depending on what kind of audience you have and if the time-dependency of the process needs to be depicted or not.

If for example, you want to show the causal system you have discovered with respect to its time-dependency, you can plot a time-series graph similar to those that we have discussed in the fundamental blog post about PCMCI:

# Plot time series graph    
tp.plot_time_series_graph(
figsize=(6, 4),
val_matrix=results['val_matrix'],
graph=results['graph'],
var_names=var_names,
link_colorbar_label='MCI',
); plt.show()

As you see above, the ts-graph you obtain will show all the lags for the tau_max you selected. It shows you all the nodes representing the variables at different points in time, with the links between them visualizing the cross-MCI value.

While the ts-graph is undoubtedly more precise and informative regarding the time dependencies between the variables, it is also quite hard to grasp at first glance. This makes it a little disadvantageous when exposing the plot to an audience that is not very knowledgeable in the field of causal inference.

The so-called process graph could help here. It allows you to give a simple overview of your results, with a plot that summarizes the information of a ts-graph into something less crowded. This is achieved by “shaving off” the time-dependencies in the plot:

tp.plot_graph(
val_matrix=results['val_matrix'],
graph=results['graph'],
var_names=var_names,
link_colorbar_label='cross-MCI',
node_colorbar_label='auto-MCI',
); plt.show()

In such process graphs the node color denotes the auto-MCI value and the link colors the cross-MCI value. If links occur at multiple lags between two variables, the link colour denotes the strongest one and the label lists all significant lags in order of their strength, minimizing the information loss although the time dimension is more or less ignored.

Wrapping it up

Congratulations! You have just done your first successful causal discovery in Python and at the same time found a way to analyze and visualize your findings intuitively.

Please be aware though, that your causal discovery is only as good as the data you use and whether it justifies the assumptions you make. If for example, you cannot assume causal sufficiency, interpreting any detected link as existing can be optimistic. Have a look at this guide/review paper for a more in-depth tour of causal discovery.

All in all, the kind of interpretation you can make largely depends on the application scenario. While we have mentioned certain stratagems that all point toward scientific scepticism, there is no real recipe that works all the time. Rather you have to make informed choices based on the data at hand.

Addressing this challenge, Tigramite offers a set of additional algorithms and independence tests that could help you in your specific projects. Tutorials about these tests, algorithms, and caveat stratagems can be found on Tigramite’s GitHub Page.

But wait, that’s not it. Causal Inference is about more than discovering causal links. It is also about estimating the size of the effects right? So once you have discovered a causal graph you are not at the end of the road. Well, we hope you keep reading because this part of the journey is material for another blog post.

Thanks!

About the authors:

Kenneth Styppa is part of the Causal Inference group at the German Aerospace Center’s Institute of Data Science. He has a background in Information Systems and Entrepreneurship from UC Berkeley and Zeppelin University, where he has engaged in both startup and research projects related to Machine Learning. Besides working together with Jakob, Kenneth worked as a data scientist in Companies such as BMW and QuantumBlack and currently pursues his graduate degree in Applied Mathematics and Computer Science at Heidelberg University. More on: https://www.linkedin.com/in/kenneth-styppa-546779159/

Jakob Runge heads the Causal Inference group at the German Aerospace Center’s Institute of Data Science in Jena and is chair of computer science at TU Berlin. The Causal Inference group develops causal inference theory, methods, and accessible tools for applications in Earth system sciences and many other domains. Jakob holds a physics PhD from Humboldt University Berlin and started his journey in causal inference at the Potsdam Institute for Climate Impact Research. The group’s methods are distributed open-source on https://github.com/jakobrunge/tigramite.git. More about the group on www.climateinformaticslab.com

--

--

Jakob Runge
Causality in Data Science

Jakob Runge heads the Causal Inference group at German Aerospace Center’s Institute of Data Science in Jena and is a guest professor at TU Berlin.