Deciphering Causal Relationships in Data: A Journey with the PC Algorithm and NetworkX

ChunYu Ko
The whispers of a data analyst
4 min readDec 26, 2023

Introduction

In the realm of data science, unraveling the intricate web of causal relationships in datasets is a formidable challenge. The PC Algorithm, a statistical method designed for causal inference, serves as a powerful tool in this quest. This article delves into the application of the PC Algorithm and NetworkX to extract possible causal relationships from data, using the Wine dataset from the UC Irvine Machine Learning Repository as a case study.

Understanding the PC Algorithm

The PC Algorithm swiftly eliminates irrelevant links in datasets, employing principles of collider and Directed Acyclic Graphs (DAGs) to unearth potential causal connections. However, the resulting causal graph is often a Completed Partially Directed Acyclic Graph (CPDAG), characterized by edges with undetermined directions, rather than a definitive DAG.

Implementation and Challenges

In my approach, I generate a graph using the PC Algorithm and then utilize NetworkX’s is_directed and is_directed_acyclic_graph functions to ascertain if the graph qualifies as a DAG. More often than not, we don't directly obtain a DAG. To resolve indeterminate edges, we can rely on manual intervention or prior knowledge, though this demands additional effort.

Exploring the Wine Dataset with PC Algorithm

The exploration begins with the Wine dataset, comprising 178 entries and 15 distinct features like Alcohol, Malic Acid, and Flavanoids.

import utils, pc_algorithm, random, copy, statistics
import pandas as pd
import numpy as np
import networkx as nx
from itertools import product
from pgmpy.estimators import BicScore

data = utils.get_data_from_ucirepo(109)
data.info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 178 entries, 0 to 177
# Data columns (total 15 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 Alcohol 178 non-null float64
# 1 Malicacid 178 non-null float64
# 2 Ash 178 non-null float64
# 3 Alcalinity_of_ash 178 non-null float64
# 4 Magnesium 178 non-null int64
# 5 Total_phenols 178 non-null float64
# 6 Flavanoids 178 non-null float64
# 7 Nonflavanoid_phenols 178 non-null float64
# 8 Proanthocyanins 178 non-null float64
# 9 Color_intensity 178 non-null float64
# 10 Hue 178 non-null float64
# 11 0D280_0D315_of_diluted_wines 178 non-null float64
# 12 Proline 178 non-null int64
# 13 class_2 178 non-null int64
# 14 class_3 178 non-null int64
# dtypes: float64(11), int64(4)
# memory usage: 21.0 KB

labels = data.columns.tolist()
data = np.array(data)

Utilizing a custom PC Algorithm, I generate a CPDAG. For an enhanced experience, I suggest using an open-source version of the PC Algorithm, such as huawei-noah/trustworthyAI, which offers greater customization.

graph, sepetated_set = pc_algorithm.pc_get_graph(data)
cpdag_graph = pc_algorithm.pc_apply_rules(graph, sepetated_set)

Visualizing and Analyzing the Graph

The CPDAG is then transformed into a network graph using NetworkX’s DiGraph. In this representation, edges with undetermined directions are marked in red, highlighting the graph’s transition from CPDAG to a non-DAG state.

network_graph = nx.DiGraph(cpdag_graph)
pos = nx.kamada_kawai_layout(network_graph)
print("Is Directed?: " + str(nx.is_directed(network_graph)))
# Is Directed?: True
print("Is DAG?: " + str(nx.is_directed_acyclic_graph(network_graph)))
# Is DAG?: False
undirected_edges = {frozenset(i) for i in network_graph.edges() if network_graph.has_edge(*i[::-1])}
edge_colors = ["red" if frozenset(edge) in undirected_edges else "gray" for edge in network_graph.edges()]

nx.draw(network_graph, pos,
edge_color = edge_colors,
width=1,
linewidths = 1,
node_size = 600,
font_size = 6,
node_color="pink",
alpha=0.9,
labels=dict(zip(range(len(labels)), labels)))

When the number of such edges is manageable, we can enumerate all possible DAGs from the CPDAG. In this instance, from 128 potential configurations, 60 valid DAGs emerged.

possible_edges = list(product([0, 1], repeat = len(undirected_edges)))
print("All posiible graphs: " + str(len(possible_edges)))
# All posiible graphs: 128

dags = []

for i in possible_edges:

network = nx.DiGraph(cpdag_graph)

for j, k in enumerate(i):

l = [*list(undirected_edges)[j]][0]
m = [*list(undirected_edges)[j]][1]

network.remove_edge(l, m)
network.remove_edge(m , l)

if k == 0:
network.add_edge(l, m)
else:
network.add_edge(m , l)

if nx.is_directed_acyclic_graph(network):

dags.append(network)

print("All DAGs: " + str(len(dags))
# All DAGs: 60

Selecting the Optimal DAG

The next step involves identifying the best-fitting DAG among these possibilities. This is achieved using the Bayesian Information Criterion (BIC), calculated using pgmpy’s BicScore. The DAG with the lowest BIC score is likely the most accurate representation of our dataset’s underlying causal structure.

bic_calculator = BicScore(pd.DataFrame(data))
bics = [bic_calculator.score(i) for i in dags]

nx.draw(dags[bics.index(min(bics))], pos,
edge_color = edge_colors,
width=1,
linewidths = 1,
node_size = 600,
font_size = 6,
node_color = "pink",
alpha=0.9,
labels=dict(zip(range(len(labels)), labels)))

Conclusion

Through this exploration, we demonstrate how combining the PC Algorithm with NetworkX can effectively identify potential causal relationships in complex datasets. The process, though intricate, opens up new avenues for understanding the hidden dynamics within our data, providing invaluable insights for data scientists and researchers.

--

--

ChunYu Ko
The whispers of a data analyst

Work is data, and hobby is also data, but I yearn for my roommate's two cats, lazily lounging at the doorway.