Causal Discovery with Multivariate Time Series Data

A Gentle Guide to Causal Inference with Machine Learning Pt. 8

Kenneth Styppa
Causality in Data Science
11 min readJan 8, 2024

--

From what you’ve read about causal discovery, it might seem like an impossible task–or at least a very tedious one–to discover causal links given ubiquitous confounding, collider bias, statistical challenges, and other pitfalls which might lead to False Negatives or False Positives. The time series case offers more opportunities, but also more challenges, and there are many handy methods out there which can help you a lot. One of the state-of-the-art methods in this field is PCMCI, which enables robust causal discovery in high-dimensional time series settings. Let’s understand how it works!

You should be familiar with the following concepts from the previous articles before you keep on reading:

  1. Causal Markov, Faithfulness and Sufficiency Assumption
  2. The fundamental idea behind causal inference
  3. Causal Graphs & D-separation
  4. The PC algorithm

Time Series Causal Discovery Quick Start:

Before we start, let’s do a quick recap of causal discovery. As the second pillar of causal inference, it allows us to identify causal relationships between the variables of our observed system. Of course, our goal is to identify as many true links while avoiding false positives. To identify causal links, we must rely on some set of assumptions and choose a suitable algorithm. Constraint-based algorithms work with conditional independence tests. It is important to remember that we do not simply conclude on a causal link whenever a conditional independence hypothesis is rejected. Instead, such a conclusion is the result of smart algorithms together with a set of assumptions we are willing to make, sometimes also called enabling assumptions. And — in theory — only if these hold, and all our conditional independence tests are correct, can we be certain that a causal graph is correct — in between are the typical challenges of any statistical method applied to finite data.

Although assumptions have to be chosen with great care, the actual implementation can be conducted algorithmically, as shown in the blog post on causal discovery with the PC algorithm. But how does this work in the time series setting?

First of all, we need a modified concept of causal graphs that include a time component.

Recall that causal graphs are directed acyclic graphs whose nodes represent our variables and whose edges are interpreted causally. For generalizing this concept to the time series setting, we index each observation in time and allow (multiple) dependencies between variables at different time lags, including a link from a variable to itself at a later time. If, for example, X1 directly causes X2 after one hour and X3 directly causes X2 after two hours and all variables are caused by their immediate past, we obtain the following time series graph (assuming “t” has the resolution of hours):

The “…” implies that this graph structure is assumed to be repetitive for all time points t (or at least the part of the time series data sample from which you try to estimate it). This important assumption is called causal stationarity and holds if the underlying system is a structural causal model (SCM) of the form:

where f_j is a (non-)linear function, ηj_t are noise terms and P(…) denotes the causal parents in the graph. As you can see, neither the parents nor the functions f depend on the time index t. If the process described by this SCM has a stable solution, the resulting time series will be stationary. This stationarity is a particular further assumption for time series causal discovery and is crucial to enable learning the graph from a time series sample generated by this process. Then we only need to test for the existence of links between nodes Xi_(t-τ) and Xj_t and can use observations at different times together to test for their conditional independence. The other standard assumptions of causal discovery, the Markov assumption and Faithfulness, are just like in the non-time series setting. With mutually independent noise terms this SCM is Markovian with respect to the graph and with non-degenerate functions f the process is also faithful to the graph. The causal sufficiency assumption has some slight further implication in the time series setting as discussed below.

The time-dependent structure adds another assumption that simplifies our endeavour: Causal influences cannot go backwards in time and all links in the graph from a lagged variable Xi_(t-τ) to Xj_t are automatically oriented as Xi_(t-τ)→ Xj_t. This allows to drop the orientation step needed in the standard PC algorithm at least for lagged links. We also want the parents to have a finite time lag, i.e., X_(t-τ) in P(Xj_t) has a finite τ_max.

While an extension of PCMCI, called PCMCI+, allows us also to orient contemporaneous links between Xi_t and Xj_t, let us keep this blog post simple and assume that the SCM above only contains lagged links.

So let us think about how we could conduct causal discovery in this setting. Let’s keep it simple for the start and work with two variables X and Y. Both depend on time t, say in units of hours. We now want to establish a method that checks whether there is a causal relationship of X_(t-τ) and Y with some time lag τ under the Markov and Faithfulness assumptions, causal sufficiency, causal stationarity, and no contemporaneous causation.

The easy way — Full CI:

The simplest way would be to condition on all past values of X1 and X2 (the total set “V”). In theory, we would like to condition on the entire infinite past. To limit the dimensionality, we must assume that all causal links in P(…) have a finite lag less than or equal to some maximum lag τ_max. So to test the dependence of X1 and X2, we check

where V-_t includes the t-1 up to t-τ_max time steps of X1 and X2.

With only 2 variables and a small τ_max, this should work pretty well: Any potential path between a non-causal X1_(t-τ) and X2_t (for a non-negative τ) would be blocked by that conditioning set because it contains the parents P(X2_t) of X2, which by the rules of d-separation are non-colliders and shield X2 from any X1_(t-τ) that is not a parent of X2_t. But as we increase the dimensionality of the problem, both in terms of the number of time series and the maximum allowed time lag, our chances of success dwindle.

What is the problem?

Firstly, if τ_max * N (where N is the number of time series variables) becomes larger than the sample size, then conditional independence testing is a pretty ill-defined problem without further tricks. And even if the time series length T is larger, the effective sample size is roughly T — τ_max * N and can be very small. This number is relevant for the probability of detecting true links (called the power of a conditional independence test).

Secondly, the above argument shows that while the entire past is sufficient to block all non-causal paths, it is not necessary — we only need some conditioning set that contains the parents Ƥ(X2_t) of X2. For the graph example above, conditioning on the past of Z is completely irrelevant here as it does not block any non-causal paths between X1 and X2.

So, instead of conditioning on all variables in our system, we need to find a way to condition on a set that contains the parents of X1 and X2 while avoiding conditioning on irrelevant variables as well as we can. This is one of the two major ideas behind PCMCI.

The other idea is statistically more subtle. Suppose we somehow have a conditioning set S that contains the parents P(X2_t) of X2 and test:

Such a test requires further assumptions about the parametric form of the dependencies among the involved variables. It requires to know the null distribution under the hypothesis of conditional independence. However, the available analytical results on null distributions all assume that the data sample is independently and identically distributed (iid). But wait, we deal with time series here and, as shown above, our variables will typically depend on past variables leading to autocorrelation which violates the independence in the iid assumption. The result is that our test will not be well-calibrated and this typically leads to false positives, i.e., we detect spurious links. PCMCI addresses this issue.

PCMCI — An Overview:

Causal Discovery with PCMCI can be split into two phases:

  1. Condition-selection: This uses a simplified (and faster) variant of the skeleton phase of the PC algorithm, called PC1, to learn a conditioning set that contains the parents for all variables Xj in the system. As outlined in another blog post, PC is a flexible causal discovery algorithm.
  2. MCI (Momentary Conditional Independence) tests:

If this hypothesis is not rejected, the edge between Xi_t−τ and Xj_t is removed. While the condition on P(Xj_t) would suffice to condition out confounded and indirect connections, the additional conditioning on P(Xi_t−τ ) removes auto-correlation effects such that the conditional independence tests are well-calibrated and false positives are better controlled at the desired level. Hence, PCMCI correctly learns the causal graph under the above assumptions. For the statistically-inclined readers: The paper also provides a mathematical intuition that the MCI tests are well-calibrated.

Independence tests

All the independence tests conducted in both the PC and MCI phase can be combined with linear (e.g. ParCorr) or non-linear tests (GPDC, CMIknn). Explaining each of them would exceed the scope of this blog post. Crucial here is that you should choose an independence test that fits the dependency you assume based on your understanding of the system.

PCMCI Step-by-Step

Red nodes: conditioning set for X³_t; Blue nodes: conditioning set for X¹_t; Star: False Positives

Phase 1

Let’s now understand how the algorithm works in more detail. Given a time-dependent system, we start by initializing the whole past of all variables Xj:

Then we iteratively sort out which of these nodes actually are possible parents of Xj.

To do so we start our first iteration (p = 0) with testing unconditional independence between Xj and each of the possible parents. If the null hypothesis, i.e. Xj_t and Xi_t-τ are independent, cannot be rejected we remove the respective variables out of our initial set of possible parents.

This serves as a first filter. In all following iterations, we increment p to p+1, sort the remaining possible parents and select the p parents with the strongest association (measured by the highest absolute test statistic value, for example, partial correlation).

This set is then used as a conditioning set when testing whether Xj is independent of a possible parent node. After each p-loop, we remove parents from P and only test links from the remaining parents. This process goes on until no more conditions can be tested, meaning all of the sorted parents have a significant dependency with Xj conditional on all other parents.

As a result, we will converge towards a set of sorted parents that contains the causal parents with high probability (red & blue boxes in the figure above). This probability can be increased if the tests are conducted at a large(r) significance level, as is recommended in PCMCI.

The following pseudo-code summarizes these steps:

Note that since these tests are all very low dimensional compared to FullCI, they have higher detection power. Thus, the outcome of the PC phase leads to a conditioning set that contains the parents with high probability. For X3_t these are the red nodes and for X1_t these are the blue nodes in the figure above. However, as can be seen in the figure, both sets can contain false positives (marked with a red/blue star), not at least because the conditional independence test is not well-calibrated as discussed above. These will be dealt with in the second phase which starts after the convergence of PC.³

Phase 2

In phase two we again test each of the possible links for lagged conditional independence using the conditioning sets for both variables:

For example, we want to test:

To do so we now simply condition on the sufficient conditioning set

that blocks non-causal confounding, and on

to alleviate the effect of autocorrelation. For each test, we now have a lower-dimensional conditioning set than in FullCI. If the hypothesis of (conditional) independence is rejected, the link is added to the list of links that describe the causal graph of the system. Under the above underlying assumptions of PCMCI, the result of these conditional independence tests converges to the true causal graph for an infinite sample size.

Furthermore, the p-values can be used as a confidence measure and the MCI test statistic may be interpreted as a normalized measure of causal strength. However, this should not be confused with estimating causal effects that have a more physical meaning.

Wrapping it up

PCMCI is an algorithm for causal discovery in a multivariate time-series setting that addresses the problems of high dimensionality by conditioning on a conditioning set that at least includes the parents of X and Y while avoiding to condition on irrelevant variables when removing links.

So far so good. This should suffice as a foundational understanding of the algorithm. For further theoretical details, we refer to the original paper.

In the next article, we will get our hands dirty and use PCMCI to conduct causal discovery with high-dimensional time series data using Python. So stick around if you want to know how you can use this algorithm in your projects!

Or directly head to the tutorials on tigramite. There we also cover variants of PCMCI that allow us to drop some assumptions. PCMCI+ allows to also detect contemporaneous links and LPCMCI does not require causal sufficiency. We now also have methods that can deal with some types of non-stationarity (Regime-PCMCI) and jointly learn causal graphs from more than one dataset (J-PCMCI+). The tutorials also expand on the different types of conditional independence tests.

Thanks for reading!

About the authors:

Kenneth Styppa is part of the Causal Inference group at the German Aerospace Center’s Institute of Data Science. He has a background in Information Systems and Entrepreneurship from UC Berkeley and Zeppelin University, where he has engaged in both startup and research projects related to Machine Learning. Besides working together with Jakob, Kenneth worked as a data scientist at Quantum Black and is currently pursuing his graduate degree in Applied Mathematics and Computer Science at Heidelberg University. More on: https://www.linkedin.com/in/kenneth-styppa-546779159/

Jakob Runge heads the Causal Inference group at German Aerospace Center’s Institute of Data Science in Jena and is chair of computer science at TU Berlin. The Causal Inference group develops causal inference theory, methods, and accessible tools for applications in Earth system sciences and many other domains. Jakob holds a physics PhD from Humboldt University Berlin and started his journey in causal inference at the Potsdam Institute for Climate Impact Research. The group’s methods are distributed open-source on https://github.com/jakobrunge/tigramite.git. More about the group on www.climateinformaticslab.com

--

--

Kenneth Styppa
Causality in Data Science

Hi! I'm a grad student in applied mathematics and computer science. I love everything around ML, tech and powerful algorithms applied to data created by humans.