Causal Inference in the Context of Social Network Metrics Analysis

Machine Learning Lab ITMO
27 min readSep 11, 2020

--

Image source: xkcd.com/552/

Hi, everyone. My name is Anver, and I’m part of the CoreML team at VK, which is one of Eastern Europe’s most popular social networks. One of our team’s goals is to improve our news feed’s post recommendation system. In this article, I will talk about some methods of increasing crucial social network metrics such as time spent, session number, etc. I applied these approaches in practice when working on my bachelor’s thesis at the ITMO machine learning lab. My thesis is dedicated to causal inference (CI) between different social network metrics that are usually estimated by A/B testing. Here, I will not only talk about tricky CI algorithms but also about the non-trivial tasks of testing and comparing models.

Problem Setting

I’ll start with a short explanation of the problem I tried to solve in my work. Imagine a simple dependency model consisting of 3 layers:

The first layer ranks posts that users see in their VK news feed. Users respond to this ranking by clicking. They might respond, for example, by leaving likes on news feed posts more often. Such changes can affect short-term metrics. It’s important to note that a model can rank posts in such a way that such short-term metrics would be predictably optimized. However, short-term metrics don’t represent users’ satisfaction with VK, while long-term metrics do. An example of a long-term metric is, for instance, time spent on the VK app.

The hypothesis is that short-term metrics can affect long-term metrics. For simplicity, I will be calling them short and long metrics respectively. The problem is finding short metrics that affect long metrics in such causal representation.

This model might be considered too naïve to apply to real-world problems. Indeed, we have to keep in mind non-observable variables, confounders, that can affect both short and long metrics. While searching for relationships between short and long metrics, we have to consider the existence of these confounders.

Insufficiency of Correlation Analysis

Even in the presence of confounders, simple feature selection (FS) can be considered robust enough for the task at hand. Unfortunately, simple FS will encounter problems in a case like this:

For the sake of this example, let’s imagine that the graph above is a true causal graph. We can also assume that there is a positive influence between every pair of connected nodes (e.g. the more likes there are, the more sessions per user (SPU) can be observed). In this case, with simple linear models, we could infer two incorrect relationships:

  1. Likes → Photo opening (as both have one consequence : SPU)
  2. Photo comments → SPU (as both have one parent: Photo opening). It could happen if we use FS to find strong features of SPU.

Such links are called spurious correlations. A causal approach is designed to overcome these issues.

Fundamental Definitions of Causal Inference

To start, we need to go over some of the basic concepts of CI. Some of them could seem tricky at first, which is why we’ll give examples.

Independence. This definition should be already well-known to you. Formally, random variables X and Y are independent i.f.f.

One classic example is a dice’s value after throwing it for the first (X) and second (Y) time. It is denoted by X ⊥ Y.

Conditional independence. Random variables X and Y are conditionally independent given Z i.f.f.

It is denoted by (X ⊥ Y | Z). To clarify, neither conditional independence of two variables implies their independence, nor the other way around. Let’s look at two counterexamples:

1.X = height of a child, Y = vocabulary of a child, and Z is age. Obviously, X and Y are dependent. However, they are conditionally independent given Z.

2. X = dice’s value after the first throw, Y = dice’s value after the second throw, and Z = X + Y. Despite the fact that X and Y are independent, their conditionally dependent given Z.

Chain rule. It’s not a definition, but a well-known formula for the expression of joint distribution:

Bayesian network. Bayesian networks are directed acyclic graphs (DAG). Such graphs have nodes with random variables containing conditional distribution dependent on their parents. A node without parents contains the distribution of a corresponding variable. A set of edges is chosen in such a way that the chain rule can be simplified as follows:

Let’s look at a classic example. Below is an assumed structure of a Bayesian network over 5 variables for US students.

Our assumption here is that Intelligence can affect a student’s Scholastic Assessment Test score (SAT score) and grade in some subject. The difficulty of the subject also affects the grade, and a reference letter from the teacher depends on the grade.

In this example, we simplify the chain rule a lot: two factors have unconditional distribution (Difficulty and Intelligence), two depend on only one variable (Reference Letter and SAT), and only one depends on two variables (Grade). In the general chain rule, we will have the same 5 factors, but they would be from 0 to 4 known variables. Take a look at this formula:

In short, Bayesian networks are data structures for simplifying the general chain rule.

Capacity of Causal Inference

Given the definitions above, we can say that the main goal of causal inference is to infer a Bayesian network structure (which is DAG). Formally, there is a definition of Markov Compatibility: if a probability distribution P admits the factorization relative to DAG G, we can say that G and P are compatible or that P is Markov relative to G.

However, this goal is unachievable when working with only observational data due to the fundamental problem even without hidden variables. The reason is that one probability distribution can correspond to more than one DAG. Take a look at two different structures below:

Factorization of the left one is:

Factorization of the right one is:

These two DAGs correspond to the same distribution. What’s more, the inverted structure of the right one should be the same, which is why we’ll need one more definition. Two DAGs G and G’ are observationally equivalent i.f.f. they have the same skeleton and the same set of v-structures. A v-structure is a structure over 3 variables (x, y, z), where exactly two of variables aren’t connected (say, x and z), and the edges are oriented from x to y, and from z to y.

Here’s an example of a v-structure that looks implausible:

What’s interesting about v-structures is that they’re the only kind of structures over three variables with skeletons that correspond to other distributions. Take a look:

This is possible only if Grade and SAT are independent. Here it’s implausible, but readers can think about a v-structure over Difficulty, Intelligence, and Grade. These structures are crucial in the orientation of a skeleton.

Therefore, we can’t set our goal to find the DAG correspondent to the distribution. But we can aim to find the maximally oriented class of equivalence (or pattern), which means that no more direction could be set via observational data.

We need one more definition before we can describe the most popular CI algorithm.

An undirected path p in DAG G is said to be d-separated (or blocked) by a set of nodes Z i.f.f. one of these statements is true:

  1. p contains a chain i → m → j or a fork i m → j such that m ∈ Z.
  2. p contains an inverted fork (or v-structure, or collider) i → m j, such that m ∉ Z and descendant(m) ∩ Z = ∅,

This is one of the most important definitions out of the ones provided above, as well as the least intuitive one, which is why I’ll try to clarify this concept.

The first statement is the case when the path p contains two dependent variables (i and j), which become independent given a variable from Z (m). Here’s an example:

In the red path, variable X₅ depends on X₁ (as one is a descendant of the other). However, they’re independent given X₃. In the illustration above, the reference letter depends on the difficulty of the subject, as difficulty can affect the student’s grade.

In the black path, the variable X₃, in general, could be dependent on X₄(as we expect SAT and Grade to be correlated), but that’s only because they have one common cause: X₂(Intelligence). With this cause, they are no longer dependent.

The second statement is the opposite: the path p contains two independent variables i and j, which become dependent given a variable from Z or its consequences.

Let’s continue examining the black path in the last example. Here X₁ and X₂ are independent (e.g. Difficulty of the subject and Intelligence of the student). Let’s put X₃ or X₅ in Z (e.g. Grade or Reference Letter). In this case, given the consequences of two independent variables, they are no longer independent. When we know that the subject is hard, but the student received a high grade or received an excellent reference letter, we can assume that the student is likely to be intelligent. So we don’t want to put these consequences in the blocking set.

Above, we considered the paths to be d-separated, but a more general definition would be more useful. When every path p between variables X and Y are d-separated by a set of nodes Z, this set is said to d-separate X and Y. Less formally, it means that Z, blocking every path between X and Y, doesn’t provide any information between them. That’s how they become (in the first case of d-separated definition) or stay (in the second case) independent given Z.

This definition has a valuable implication that connects graph theory and probability theory. If X and Y are d-separated with given Z in DAG G, then
(X ⊥ Y|Z)ₚ in every distribution P which is compatible to G. In a sense,
d-separation is an equivalent of conditional independence.

For more detailed explanations of these fundamental concepts, I strongly recommend Daphne Koller’s course on Coursera: Probabilistic Graphical Models.

Peter and Clark Algorithm

Now we’ll look through the classic algorithms for finding casual structures. Generally, CI algorithms for structure inference have two stages:

  1. Build a skeleton.
  2. Orient the built skeleton.

The first stage is very time-consuming. The most naïve approach for skeleton inference is to check the dependence of each pair of nodes given any set of nodes except the selected ones. The complexity of this approach can be defined as O(n² 2ⁿ) tests of conditional independence, where n is an amount of observed random variables. And these tests could also be time-consuming. Of course, when we find a d-separating set for two nodes, we can stop testing other sets for them. But that doesn’t affect asymptotic complexity at all.

Even though it seems unacceptable, that’s how the inductive causation (IC) algorithm works. However, it still can’t be applied to most real-world problems. An algorithm (which we’ll describe below) uses a strong assumption that there are no hidden variables. That’s not exactly how this assumption is really worded, but we will return to it later.

Given that we can observe every variable required for inferring a plausible structure, there is an important fact that can significantly improve the brute-force algorithm. If two variables X and Y are d-separated by some set Z, then they are d-separated by some subset of Adjacent(X) or Adjacent(Y). Adjacent(X) is a set of vertices adjacent to X (ignoring direction).

We can apply this knowledge by trying to find a d-separating set exclusively among neighbors of these variables. You might note that asymptotic complexity still wasn’t reduced, due to the fact that finding an optimal skeleton is an NP-hard problem. However, true graphs in causality usually have low density. With constraints on a constant maximal vertex degree of the graph, this approach becomes polynomial. This algorithm is called the Peter and Clark algorithm (PC).

The next problem is finding out how to orient the graph in such a way so that we’d get a maximally oriented pattern. A pattern is called maximally oriented when any new oriented edge will change the set of v-structures or produce a cycle. Even though orienting is a bottleneck of causal inference, this step is quite simple. It’s usually split into two phases:

  1. Orienting colliders (v-structures).
  2. Applying rules based on a limited amount of vertices where possible.

The first stage is easy to understand. We choose 3 variables (x, y, z), which can be oriented as a v-structure with exactly two edges between them. After that, we check whether the vertex that is connected to two others (say, y) is in a d-separating set of two other vertices (x and z, respectively), or not.
If y ∈ Sep(X, Z), it means that y blocks the undirected path x — y — z. That’s why y is not a collider. On the other hand, if y ∉ Sep(X, Z), then y is a collider, so we need to orient this triple as x→ y← z.

The second stage contains several repeating rules that can guarantee a maximal oriented pattern. There is a set of 4 rules (it’s not the only known set) that can guarantee it. I won’t prove their correctness, and I don’t think that it’s necessary to list them. For the sake of example, however, I will demonstrate one of these rules:

If given triple a → b — c, where a and c are not adjacent, orient edge b — c as
b →c. Since b is not a collider (otherwise it should be oriented during stage 1), it is in Sep(A, C). Therefore, this edge can’t be oriented to b, which is why we orient it to c.

Both stages are polynomially dependent on the number of variables, and all we have to do during this phase is traverse through sets of variables (the size of which is limited) and test some conditions.

Main Hypotheses of Causal Inference

After we had examined the PC algorithm, we came up with a CI algorithm which is polynomial under some natural constraints. That’s great, but let’s take the time to think about what we assume in this algorithm.

  1. Causal Markov Condition. Let G — DAG and P is its compatible distribution over a set of variables V. Causal Markov condition is true i.f.f. w is conditionally independent on its non-descendants given its parents.
    Less formally, it means that each variable only depends on its direct causes (parents).
  2. Causal Minimality Condition. Let G — DAG and P is its compatible distribution over a set of variables V. Causal minimality condition is true i.f.f. every subgraph doesn’t satisfy the causal Markov condition.
    Look at the example below. The left graph doesn’t satisfy the causal minimality condition. This condition helps put constraints on the skeleton of the desired graph. We don’t want to build a complete graph and then orient it.
  3. Faithfulness. Let G — DAG and P is its compatible distribution over a set of variables V. The faithfulness hypothesis is satisfied i.f.f. every two vertices X and Y are conditionally independent given any subset of vertices Z i.f.f. X and Y are d-separated from Y given Z.
    Less formally, we can detect conditional independence perfectly. Of course, in the real world, we don’t know the true distribution. We only have the data, and some approaches of getting a probability function over it. In causality, we don’t need to know the probability, but we have to make conditional independence tests. And this hypothesis assumes that we can do it perfectly, which is highly unlikely.
  4. Causal sufficiency. Set of variables V is called causally sufficient i.f.f. all causes of at least two variables from V are also in V, or they have equal value for each observation.
    This hypothesis was informally mentioned in the description of the PC algorithm: “there are no hidden variables”. The formal version is less strict, but it is still a strong assumption.
    For example, let’s assume V = {a, b, c}, and (a ⊥ c) due to the probability function P. Then we can say that both graphs below satisfy the causal Markov condition.

These conditions look alright, which is what we expect to receive as a result. However, these hypotheses seem questionable. Fortunately, there are ways to overcome the causal sufficiency assumption.

Formal Causal Markov Condition
Example of Causal Minimality Condition
Formal faithfulness hypothesis

Inference Without Causal Sufficiency

Let’s consider, based on our previous example, what we want to achieve when casual sufficiency is not satisfied:

Imagine that we don’t know Intelligence of the student, which is very likely to be the case in real life. This is what the PC algorithm will return in this instance:

As you can see, it’s almost fine. Ideally, we’d want a result like this:

The edge between Grade and SAT should be of a type other than “non-oriented”. It should show that there is a common cause between Grade and SAT as opposed to a direct influence of one of these two variables.

You might assume that the PC algorithm is building the graph skeleton correctly. Unfortunately, that’s not the case. Let’s consider this graph:

Here’s an opportunity to test yourself: what is a d-separation set of X and Z?

The answer is {Y, H}. {Y} is not enough, as we have the path <X, Y, H, Z>, which is not d-separated by {Y}. That’s why, when we observe only
V = {X, Y, Z}, an edge between X and Z will not be eliminated. In addition, unlike the situation in the previous example, there’s no common cause between them. Without causal sufficiency, the PC algorithm doesn’t work properly. Brute-force algorithms will not help either, since some vertices could have no separation set at all. However, in some examples, there’s a separation set in observed variables but not among the neighbors of two vertices.

Here, T and S are hidden variables. The PC algorithm won’t eliminate the edges (A, B) and (D, E), and that’s alright. However, A and E are d-separated with the set {B, C, D}, while Adj(A) = Adj(E) = {B, D}. That means that, having hidden variables, we can keep edges that can be eliminated even with observed variables (but not with neighbors). That’s why we need to rethink what we want to achieve with an algorithm without causal sufficiency.

Inducing Path Graphs

The most important definition in CI without causal sufficiency is inducing path. To understand what this is, let G — DAG over variables V, and O ⊂ V (observed variables), and A, B ∈ O. Then we will call undirected path U between A and B inducing i.f.f.:

  1. Ɐt : t ∈ U ⋂O, t — collider or A or B.
  2. Every collider in the U — ancestor of A or B.

Intuitive, isn’t it? Let me try to clarify.

Consider the path U=<A, B, C, D, E, F>. Imagine that O = {A, B, F} or O = {A, B, D, F}. Here, U is an inducing path, but if O = {A, B, C, D, F}, then it’s not inducing anymore.

One can think that it’s still a strange term, but it’s really not. Do you remember when a path is called d-separating by some set Z? The first case is when there’s a fork or chain with a vertex in Z. And the first case of the inducing path definition forbids such structures. Formally, it leads to a property of an inducing path. Let G — DAG over variables V, and O ⊂ V, and
A, B ∈ O. Then A and B are not d-separated by any subset of O \ {A, B} i.f.f. there is an inducing path between A and B over O. You can test it on the previous example.

When we know the definition of inducing paths, we can come up with inducing path graphs. Let G be a DAG over variables V and O ⊂ V. The graph (which is not DAG) G’ is an inducing path graph over O for G i.f.f. there is an edge A → B i.f.f. there is an inducing path between A and B which enters B.

Using the property of an inducing path, it’s easy to understand the properties of inducing path graphs: let X, Y ∈ O, S ⊂ O — disjoint elements and set of variables in G’. Then X and Y are d-separated given S in G’ i.f.f. they are d-separated given S in G.

This means that, without causal sufficiency, we would want to build an inducing path graph instead of a Bayesian network structure. However, as previously mentioned, we can’t even theoretically infer an inducing path graph itself, but only its pattern (its class of equivalence). It is called a partially oriented inducing path graph.

There is a challenge of representation in this case. A Bayesian network pattern is a DAG, where uncertainty can be expressed by bidirected edges. An Inducing path graph is not a DAG, and bidirected edges in this graph are okay when there is a common hidden cause between two observable variables. That is how new types of edges appear: A o — o B. This circle means that there could be either an arrow or an empty mark.

Causal Inference Algorithm Ideas Without Causal Sufficiency

Unfortunately, the description of the CI algorithm is too bulky for this article. Generally, it builds a skeleton as the IC (brute-force) algorithm, orients every edge as A o — o B, and then starts the orienting phase with plenty of rules (there are 10, as opposed to 4 in PC).

Because of the skeleton building stage, this algorithm is very impractical. The PC algorithm helped us because we could look for separating sets among a very limited subset of nodes. Without causal sufficiency, this approach doesn’t work. But there is another way to limit the search variety, and this idea appears in the FCI (Fast CI) algorithm.

The first idea has to do with the fact that the PC algorithm doesn’t eliminate any excess edges. This makes it possible to use this algorithm to remove most of the edges between variables. The second idea is to use orientation after the PC algorithm to build the so-called Possible D-SEP. This is a set of nodes for each pair of nodes without variables that definitely couldn’t d-separate this pair. How it’s built is quite difficult to describe and unnecessary in the scope of this article. But the main idea is that it shortens the variety of searches for d-separation sets.

The FCI algorithm is basically the CI algorithm with a more efficient stage of building the skeleton. However, it‘s still not very applicable for high dimensions, because its time complexity depends exponentially on the number of nodes. That’s why the RFCI (Colombo et al., 2011) and FCI+ (Claassen et al., 2013) algorithms were created. The first of them is not complete, which means that it is not maximally oriented, but it’s efficient for high dimensions. Just as the FCI algorithm (the completeness of which was proven in 2008), the FCI+ algorithm is complete and reduces Possible D-SEP in such a way that the algorithm becomes polynomial if the maximal degree of nodes is limited by a constant value. And in practice, it’s fast indeed!

Proposed Modification of Classic Algorithms

Great! So we have an algorithm that can not only represent influence but show the presence of hidden variables as well. More than that, this algorithm is both correct and complete! What could go wrong?

Well, problems start to arise when we try to apply this algorithm without the satisfaction of faithfulness (which can’t be satisfied in the real-world problems). That’s why I decided to come up with a modified orientation algorithm for my bachelor’s thesis.

So far, I haven’t mentioned anything about checking conditional independence. We simply assumed that we could check it somehow. However, when we want to apply an algorithm, we need to decide how we want to perform such tests. The most popular way of doing so for these algorithms is using a partial correlation test. The first reason for its popularity is its efficiency. The algorithm requires only precalculating the correlation matrix of all variables, and that’s it. After that, each test (A ⊥ B | Z) can be performed for O(|Z|³), and |Z| is usually quite small (e.g. for the PC algorithm it’s upper-bounded with a maximal vertex degree). So queries are not dependent on the number of observations! The second reason is that it’s implemented in the popular pcalg library. This library contains all aforementioned algorithms, and its kernel is implemented on RCPP, which is why it’s really efficient. There might be a causal relationship between these two facts.

As you might’ve noticed, the partial correlation test doesn’t have anything in common with perfect conditional independence tests. This is why the described algorithms have trouble when working with real data. The orientation stage of these algorithms demonstrates the worst performance. Almost every edge between variables could be represented as bidirected. Indeed, the PC algorithm doesn’t specify the directions, showing patters that aren’t so informative. The FCI algorithm (and other algorithms similar to it) can suggest that there is a hidden variable that affects two nodes. This is what happened in my thesis when using real data. Almost every edge was oriented as bidirected. Even though it is possible, it doesn’t provide a way to get any valuable information from this orientation, and that’s a problem.

After some analysis, I noticed that most of the arrows appear in the stage of orienting colliders. So I decided to strengthen the condition of orienting a triple as a collider. We orient colliders by taking a triple with exactly 2 edges: a — b — c, and check whether b is in the separation set of a and c or not. If not, then we orient this triple as a collider. Simple enough, and probably too simple for orienting an edge.

So I came up with this next idea. If I were to add a vertex to the separation set, would these vertices be significantly less likely to be separate? If so, then it would appear that the middle vertex is a collider. And indeed, it “unblocks” information between two vertices, which is exactly what a collider does. So my orientation algorithm suggests providing new conditional independence tests, and we also receive a new hyperparameter that can adjust the number of colliders. This hyperparameter represents how strong the effect of adding a middle variable to an existing separation set is.

Let’s consider an example.

Imagine that it’s a true DAG from which data is distributed. There is only one d-separation (as only two vertices are not connected): c and d are d-separated with {a, b}. It’s a simple exercise. You can test it yourself. With the faithfulness assumption, it means that there is only one conditional independence (c ⊥ d |{a, b}). In practice, due to the limitations of data or quality of tests, we can infer that (c ⊥ d |{a}) or (c ⊥ d | {b}). In this case, we will never observe the true d-separation set. Without loss of generality, we can assume that the algorithm inferred (c ⊥ d |{a}). This might not seem like that big a deal, but, unfortunately, it is. It will not affect the building of the skeleton, but it will affect edge orientation.

Let’s observe a potential v-structure (b, c, d). In both PC and FCI algorithms, it will check the following: is b in a d-separation set of c and d? If no, they will orient edges to vertex b. That’s how the correctness of the output is violated! Below is the algorithms’ output:

So what output are we expecting here? Since we can only infer patterns of the true causal graph without having any v-structures in the true DAG, we can’t orient the edges even theoretically. My algorithm is created exactly for reducing the number of uncertain colliders and consequently increasing the correctness of the output. And in this case, it will check the following: if vertex b is added to the separation set of c and d (which is a), would it significantly decrease the probability that c and d are conditionally independent?

Here, the answer would be no; it wouldn’t decrease that probability. And no collider would be oriented here, so the output of my algorithm is below:

Estimation and Comparison of Causal Inference Algorithms

When people come up with an algorithm or create a new model, they do comparisons using already existing approaches. However, when it comes to causality, using comparisons can cause controversies, as there is no standard testing technique for it. A paper on NeurlPS from 2019 entitled The Case for Evaluating Causal Models Using Interventional Measures and Empirical Data raises this exact problem.

Testing, in general, could be split to three categories: on synthetic, semi-synthetic, and real data. The problem is that real data is often rare in causality problems, since the ground truth structure is usually unknown. That’s why real data should be very specific, where you know the true structure and have a lot of observations.

In my field of research, there’s an opportunity to test some edges by running A/B tests. However, doing so isn’t cheap. A/B tests take a lot of time (approximately 2 weeks) and also require accurate manual setup. In other fields, such as medicine, these experiments can cost even more or are simply impossible to run. That’s why I tried to come up with offline methods for testing structure quality.

The first idea is to calculate the likelihood of the graph:

Here, PAG(X) is a set of parents of the node X, I(X, Y) is the mutual information between random variables X and Y (we can treat a set of random variables like a random variable), and H(X) is the entropy of random variable X. So only the first term depends on the graph structure, and that’s exactly what we‘ll calculate later. MI and entropy aren’t easy to calculate either, but methods for them are described in papers Estimating mutual information (Kraskov et al., 2003) and Nonparametric entropy estimation: an overview (Beirlant et al., 1997). They are based on the kNN algorithm, and I implemented them in my thesis because I didn’t find good alternatives in existing libraries to working with continuous data for multidimensional variables.

When working with fully synthetic data, a more popular approach is to generate a Bayesian network, sample data from it, and then compare the inferred structure with the known ground truth one. But I’ve chosen a bit of a different approach. The step of generating a Bayesian net was swapped with inferring it with one of the algorithms (I used the PC algorithm for it). When sampling, I used an idea from “Towards Robust and Versatile Causal Discovery for Business Applications” (Borboudakis et al.), where they used a linear parametrization, the coefficients were sampled uniformly at random from the range ±[0.2, 0.8] with an added error term. In my work, the data for vertices without parents was sampled from the normal distribution, which had parameters for each vertex taken from the real data.

In some configurations of the test, I eliminated random variables after testing. In this case, the ground truth graph is obtained by creating directed edges from parents of hidden variables to its children and by creating bidirected edges between children of hidden variables.

But how do we evaluate graphs even when we have the ground truth? I’ve chosen a method described in Robust reconstruction of causal graphical models based on conditional 2-point and 3-point information (Affeldt et al. 2015). The authors used a classification metric F1 for edges, where a positive class is the existence of an edge. It means that true positive (TP) appears when we correctly guessed the dependency between two variables, and false positive (FP) when the inferred dependency doesn’t occur in the ground truth. One can note that this method doesn’t consider the edges’ orientation (if we talk about the directed graphs, then we penalize equally for the wrong direction and the wrong edge in the skeleton). That’s why the authors of the above-mentioned work propose a new class TP* for edges which were inferred correctly in terms of the skeleton but not in terms of direction. They calculate TP’ = TP — TP* and FP’ = FP +TP*.

The authors compared DAGs, but I decided to compare patterns with ground truth graphs. That’s why, with Causal Sufficiency, I added 0.5 to TP* if one of the pattern edges exists in the ground truth. If the direction was inferred completely wrong, then I added 1. Without Causal Sufficiency, I added 1 in every case of misorientation, because bidirected edges have different semantics.

The last option considered, which isn’t that good in my field of research but which could be helpful in others, is to use prior knowledge. For instance, we can know some dependencies between variables or some more tricky things like, “Age doesn’t depend on anything in our system of variables”. Then it’s possible to understand how the inferred graph corresponds to this prior knowledge.

We tried to apply this idea in my work, but it is both more complicated and less convincing, because it’s hard to know true dependencies between metrics of the social network. This prior knowledge was achieved by my colleagues answering questions in my poll. Unfortunately, I had to discard this idea because assessments were too diverse in comparison with the number of correspondents, which was strictly limited. Although in my work I didn’t succeed with this approach, I believe that it should still be considered in other fields.

Results of Comparing Algorithms

Let’s start with the analysis of semisynthetic data. First of all, I compared the classic PC algorithm to my approach in orientation:

As you can see, these approaches show almost the same results in terms of F1 (with and without considering orientation). That is to be expected, considering that the direction of the edges significantly reduces the F1 score, but the new method of orientation didn’t change anything significantly.

Let’s hide some variables. I decided to hide 10 out of 50 variables because hiding more than 10 of them leads to high variance. We can’t compare the PC and FCI+ algorithms, since their outputs have different semantics: the pattern of the Bayesian network and the pattern of the inducing path graph respectively. For instance, we won’t exactly know what to do with bidirected edges. So let’s see how this modification works:

As you might’ve noticed, orientation became much better with this new approach. The F1 scores for skeletons are the same, of course, since I didn’t change anything in the algorithm for building skeletons.

There is one more thing that I decided to test. In the introduction to the problem of unsatisfied Causal Sufficiency assumption, I said that it’s impossible to infer the skeleton in some cases. However, in the description of the FCI algorithm, I said that most of the edges are eliminated with the PC algorithm. So I compared how the modified orientation will work with skeletons obtained after the PC algorithm and after the FCI algorithm:

In practice, there’s no difference! That means that you can use a much faster algorithm for building skeletons, which can theoretically have some odd edges.

I also faced some problems with likelihood estimation because inducing path graphs couldn’t be evaluated with such an approach at all. This is because edges have a semantic that’s a bit different from Bayesian networks. One would assume that at least Bayesian network patterns could be compared without problems. Unfortunately, the likelihood score is a non-decreasing function depending on edges. This means that the complete graph should have the highest score. However, this fact doesn’t affect our evaluation because the new modification algorithm has a parameter for adjusting the amount of edges indirectly by adjusting the number of colliders.

The PC algorithm in practice is order-dependent. It might be hard to believe, but it’s true. The order affects which separating set will be found first. The stable version of the PC algorithm (implemented in pcalg, which I used for all of my experiments) is order independent for building skeletons, but not for the orientation stage. That’s why the likelihood would differ depending on the order of variables. I’ll once again use box plots to visualize the results.

Surprisingly, the new orientation provides much better results with even fewer edges:

This is great news, but because of difficulties in comparing models, you should be skeptical about the results of the comparison. Real intervention is the most truthful approach for evaluating results. But even real experiments should be run very carefully to decrease the influence of random effects.

Conclusion

I hope that you are now familiar with the key ideas of causal inference, including:

  1. Why CI can be sometimes preferred over other fundamental approaches such as feature selection.
  2. Basic concepts of causality (e.g d-separation).
  3. Classic algorithms (PC, FCI) of CI, and their recent modifications, which are more practical.
  4. Issues of classic algorithms and how they can be overcome.
  5. How to compare the models obtained by CI models.

Hope you enjoyed this article!

--

--

Machine Learning Lab ITMO

Welcome to the ML Lab of ITMO University! Here you’ll find information about our projects, events and achievements. We’re always open to new ideas and collabs!