Deciphering Causal Relationships in Data: A Journey with the PC Algorithm and NetworkX

Published in

The whispers of a data analyst

4 min readDec 26, 2023

Introduction

In the realm of data science, unraveling the intricate web of causal relationships in datasets is a formidable challenge. The PC Algorithm, a statistical method designed for causal inference, serves as a powerful tool in this quest. This article delves into the application of the PC Algorithm and NetworkX to extract possible causal relationships from data, using the Wine dataset from the UC Irvine Machine Learning Repository as a case study.

Navigating the Intricacies of Causal Inference: My Journey with SCM and Beyond

medium.com

Understanding the PC Algorithm

The PC Algorithm swiftly eliminates irrelevant links in datasets, employing principles of collider and Directed Acyclic Graphs (DAGs) to unearth potential causal connections. However, the resulting causal graph is often a Completed Partially Directed Acyclic Graph (CPDAG), characterized by edges with undetermined directions, rather than a definitive DAG.

Implementation and Challenges

In my approach, I generate a graph using the PC Algorithm and then utilize NetworkX’s is_directed and is_directed_acyclic_graph functions to ascertain if the graph qualifies as a DAG. More often than not, we don't directly obtain a DAG. To resolve indeterminate edges, we can rely on manual intervention or prior knowledge, though this demands additional effort.

Exploring the Wine Dataset with PC Algorithm

The exploration begins with the Wine dataset, comprising 178 entries and 15 distinct features like Alcohol, Malic Acid, and Flavanoids.

UCI Machine Learning Repository

Discover datasets around the world!

archive.ics.uci.edu

import utils, pc_algorithm, random, copy, statistics
import pandas as pd
import numpy as np
import networkx as nx
from itertools import product
from pgmpy.estimators import BicScore

data = utils.get_data_from_ucirepo(109)
data.info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 178 entries, 0 to 177
# Data columns (total 15 columns):
#  #   Column                        Non-Null Count  Dtype  
# ---  ------                        --------------  -----  
#  0   Alcohol                       178 non-null    float64
#  1   Malicacid                     178 non-null    float64
#  2   Ash                           178 non-null    float64
#  3   Alcalinity_of_ash             178 non-null    float64
#  4   Magnesium                     178 non-null    int64  
#  5   Total_phenols                 178 non-null    float64
#  6   Flavanoids                    178 non-null    float64
#  7   Nonflavanoid_phenols          178 non-null    float64
#  8   Proanthocyanins               178 non-null    float64
#  9   Color_intensity               178 non-null    float64
#  10  Hue                           178 non-null    float64
#  11  0D280_0D315_of_diluted_wines  178 non-null    float64
#  12  Proline                       178 non-null    int64  
#  13  class_2                       178 non-null    int64  
#  14  class_3                       178 non-null    int64  
# dtypes: float64(11), int64(4)
# memory usage: 21.0 KB

labels = data.columns.tolist()
data = np.array(data)

Utilizing a custom PC Algorithm, I generate a CPDAG. For an enhanced experience, I suggest using an open-source version of the PC Algorithm, such as huawei-noah/trustworthyAI, which offers greater customization.

GitHub - huawei-noah/trustworthyAI: trustworthy AI related projects

trustworthy AI related projects. Contribute to huawei-noah/trustworthyAI development by creating an account on GitHub.

github.com

graph, sepetated_set = pc_algorithm.pc_get_graph(data)
cpdag_graph = pc_algorithm.pc_apply_rules(graph, sepetated_set)

Visualizing and Analyzing the Graph

The CPDAG is then transformed into a network graph using NetworkX’s DiGraph. In this representation, edges with undetermined directions are marked in red, highlighting the graph’s transition from CPDAG to a non-DAG state.

network_graph = nx.DiGraph(cpdag_graph)
pos = nx.kamada_kawai_layout(network_graph)
print("Is Directed?: " + str(nx.is_directed(network_graph)))
# Is Directed?: True
print("Is DAG?: " + str(nx.is_directed_acyclic_graph(network_graph)))
# Is DAG?: False

undirected_edges = {frozenset(i) for i in network_graph.edges() if network_graph.has_edge(*i[::-1])}
edge_colors = ["red" if frozenset(edge) in undirected_edges else "gray" for edge in network_graph.edges()]

nx.draw(network_graph, pos,
        edge_color = edge_colors,
        width=1,
        linewidths = 1,
        node_size = 600,
        font_size = 6,
        node_color="pink",
        alpha=0.9,
        labels=dict(zip(range(len(labels)), labels)))

When the number of such edges is manageable, we can enumerate all possible DAGs from the CPDAG. In this instance, from 128 potential configurations, 60 valid DAGs emerged.

possible_edges = list(product([0, 1], repeat = len(undirected_edges)))
print("All posiible graphs: " + str(len(possible_edges)))
# All posiible graphs: 128

dags = []

for i in possible_edges:
    
    network = nx.DiGraph(cpdag_graph)
    
    for j, k in enumerate(i):
        
        l = [*list(undirected_edges)[j]][0]
        m = [*list(undirected_edges)[j]][1]
        
        network.remove_edge(l, m)
        network.remove_edge(m , l)
        
        if k == 0:
            network.add_edge(l, m)
        else:
            network.add_edge(m , l)
    
    if nx.is_directed_acyclic_graph(network):
        
        dags.append(network)

print("All DAGs: " + str(len(dags))
# All DAGs: 60

Selecting the Optimal DAG

The next step involves identifying the best-fitting DAG among these possibilities. This is achieved using the Bayesian Information Criterion (BIC), calculated using pgmpy’s BicScore. The DAG with the lowest BIC score is likely the most accurate representation of our dataset’s underlying causal structure.

bic_calculator = BicScore(pd.DataFrame(data))
bics = [bic_calculator.score(i) for i in dags]

nx.draw(dags[bics.index(min(bics))], pos,
        edge_color = edge_colors,
        width=1,
        linewidths = 1,
        node_size = 600,
        font_size = 6,
        node_color = "pink",
        alpha=0.9,
        labels=dict(zip(range(len(labels)), labels)))

Conclusion

Through this exploration, we demonstrate how combining the PC Algorithm with NetworkX can effectively identify potential causal relationships in complex datasets. The process, though intricate, opens up new avenues for understanding the hidden dynamics within our data, providing invaluable insights for data scientists and researchers.