Navigating the Inconsistencies of the PC Algorithm: Insights and Improvements

ChunYu Ko
The whispers of a data analyst
5 min readDec 27, 2023

As data scientists and researchers delve into the complexities of causality, the PC (Peter-Clark) algorithm stands as a cornerstone in causal discovery. However, its users often encounter a puzzling phenomenon: why does the same dataset yield different results under this algorithm? This article aims to demystify this conundrum and shed light on several advancements in the PC algorithm.

At the heart of the issue is the dependency of the classic PC algorithm’s outcomes on the input variables’ ordering. Studies reveal that in datasets with a larger number of variables, the results are more susceptible to this ordering. All algorithms that generate a skeleton based on the classic PC approach are vulnerable to this, regardless of how they later determine edge directions.

The order of variables affects both the formation of the skeleton and the determination of edge directions. The sequence in which variables are tested for independence determines the irrelevant edges to be removed, thus updating the skeleton and altering the adjacent node set. Given that independence tests are not always accurate, this can lead to errors in decisions to retain or remove edges, resulting in different skeleton structures. Furthermore, as the separating sets are influenced by order, decisions about V-structures can be erroneous, leading to variations in the final CPDAG (Completed Partially Directed Acyclic Graph) structures. In summary, ordering impacts the definition of the skeleton, identification of V-structures, and the direction of edges.

Consider a real DAG (Directed Acyclic Graph) structure and its corresponding data. If the PC algorithm is applied to a randomly reordered variable sequence, the outcome might incorrectly show the direction of node (7,2), fail to identify the direction of (2,0), and erroneously remove the edge (5,2).

import networkx as nx
from castle.datasets import DAG, IIDSimulation

weighted_random_dag = DAG.erdos_renyi(n_nodes=10, n_edges=15, weight_range=(0.5, 2.0), seed=1)
dataset = IIDSimulation(W = weighted_random_dag, n = 2000, method = "linear", sem_type = "gauss")
true_dag, data = dataset.B, dataset.X
true_network = nx.DiGraph(true_dag)
pos = nx.spring_layout(true_network)
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

def plotly_network(network, pos, title = None):

nodes = pd.DataFrame([{"x": pos[node][0], "y": pos[node][1], "label": node} for node in network.nodes()])
edges = [ (pos[i[0]][0], pos[i[0]][1], pos[i[1]][0], pos[i[1]][1]) for i in network.edges]
arrows = [ go.layout.Annotation(dict(x=x, y=y, ax=ax, ay=ay, xref="x", yref="y", text="", showarrow=True, axref="x", ayref="y", arrowhead=2, arrowwidth=1, arrowcolor="rgba(140,31,40, .8)", standoff = 10, startstandoff = 10)) for x, y, ax, ay in edges]

figure = px.scatter(nodes, x = "x", y = "y", text = "label", width=600, height = 400).\
update_layout(
xaxis=dict(showgrid=False, title="", zeroline = False, showticklabels=False),
yaxis=dict(showgrid=False, title="", zeroline = False, showticklabels=False),
title=title,
template="plotly_white",
annotations = arrows).\
update_traces(marker=dict(size=30, color = "rgba(242,242,242,1)"),
textfont = dict(color = "rgba(4,64,64,1)"),
hovertemplate = "%{text}")

return figure

plotly_network(true_network, pos, "True DAG")
from castle.algorithms import PC

pc = PC()
pc.learn(data)
pc_1_network = nx.DiGraph(np.array(pc.causal_matrix))

plotly_network(pc_1_network, pos, "CPDAG by PC")

An experiment with 1,000 random variable orderings using the classic PC algorithm revealed significant variability in results. On average, the True Positive Rate (TPR) of the generated CPDAGs was only 17%, with the best performance being a rare 50%.

from castle.metrics import MetricsDAG
from random import shuffle

data_test = pd.DataFrame(data)
label_text = list(data_test.columns)
pc = PC()
result_random_orders = []

for i in range(1000):
shuffle(label_text)
pc.learn(np.array(data_test[label_text]))
result = MetricsDAG(pc.causal_matrix, true_dag).metrics
result["order"] = ", ".join([str(i) for i in label_text])
result_random_orders.append(result)

result_random_orders = pd.DataFrame(result_random_orders)
result_random_orders["id"] = [*result_random_orders.index]

colors = {"fpr": "rgba(242,102,139,.3)", "tpr": "rgba(3,166,136,.3)"}

px.scatter(result_random_orders, x = "id", y = ["tpr", "fpr"], width=800, height = 400, marginal_y = "histogram").\
update_layout(
xaxis=dict(title="Random Test Times", zeroline = False, tickformat = ",.0f"),
yaxis=dict(title="", zeroline = False, tickformat = ".0%"),
template="plotly_white", legend_title_text = "DAG Metrics").\
for_each_trace(lambda t: t.update(name=t.name.upper()+"%", marker=dict(color=colors[t.name])))

The highest TPR variable ordering in the real DAG begins with node 9, which has the most child nodes, proceeding to the independent node 6, then to node 0, and finally to node 2.

update_orders = [i.split(", ") for i in result_random_orders.sort_values("tpr", ascending = False).head(10)["order"].tolist()]

update_order = update_orders[0]
title = "Best Order: " + " > ".join(update_order)
update_order = [int(i) for i in update_order]
nodes = pd.DataFrame([{"x": pos[node][0], "y": pos[node][1], "label": node} for node in true_network.nodes()])
edges = [(pos[i[0]][0], pos[i[0]][1], pos[i[1]][0], pos[i[1]][1]) for i in true_network.edges()]

arrows = [go.layout.Annotation(dict(
x=x, y=y, ax=ax, ay=ay, xref="x", yref="y",
text="", showarrow=True, axref="x", ayref="y",
arrowhead=2, arrowwidth=1, arrowcolor="rgba(140,31,40, .8)",
standoff=10, startstandoff=10)) for x, y, ax, ay in edges]

nodes["color"] = "rgba(242,242,242,1)"

fig = go.Figure(data=[go.Scatter(x=nodes["x"], y=nodes["y"], text=nodes["label"], mode="markers+text",
marker=dict(color=nodes["color"], size=30), textfont=dict(color="rgba(4,64,64,1)"))],
layout=go.Layout(
xaxis=dict(showgrid=False, title="", zeroline=False, showticklabels=False),
yaxis=dict(showgrid=False, title="", zeroline=False, showticklabels=False),
template="plotly_white",
annotations=arrows))

frames = []
for step, node in enumerate(update_order):
nodes.loc[nodes["label"] == node, "color"] = "rgba(218,253,186,1)"
frame = go.Frame(data=[go.Scatter(x=nodes["x"], y=nodes["y"], text=nodes["label"], mode="markers+text",
marker=dict(color=nodes["color"], size=30), textfont=dict(color="rgba(4,64,64,1)"))])
frames.append(frame)

fig.frames = frames

fig.update_layout(title = title, width=400, height=400,
updatemenus=[dict(type="buttons",
buttons=[dict(label="Play", method="animate",
args=[None, {"frame": {"duration": 500, "redraw": True},
"fromcurrent": True,
"transition": {"duration": 300, "easing": "quadratic-in-out"}
}
]
)]
)]
)

fig.update_layout()

fig.show()

Several methods have been proposed to improve the PC algorithm:

  • PC-stable: Instead of immediately removing edges upon discovering conditional independence between nodes, the algorithm waits until all independence tests for the current node are completed before collectively removing all edges that should be deleted.
  • Conservative PC (CPC) / Majority rule PC (MPC): V-structures are only defined when a third node is either 100% present or absent in the separating set of two nodes. CPC requires a 100% presence/absence in the separating set, while MPC requires 50%.
  • L-method: Edge directions are not determined by order but by listing all possible edge orderings. Controversial edges are left undirected, and only undisputed edges are directed. L-CPC and L-MPC apply the L-method after CPC or MPC.
  • Fast Causal Inference (FCI) / Really Fast Causal Inference (RFCI): After forming the skeleton and identifying V-structures, these methods conduct additional independence tests on all non-adjacent nodes. RFCI further incorporates more rules and independence tests in V-structure and edge direction determinations to ensure the validity of edges.
  • CRFCI-stable / MRFCI-stable: These methods are based on the PC-Stable version of CPC and FCI, and the PC-Stable version of MPC and RFCI, respectively.

By understanding and implementing these advancements, data scientists can navigate and mitigate the inherent inconsistencies of the PC algorithm, paving the way for more accurate and reliable causal discovery.

--

--

ChunYu Ko
The whispers of a data analyst

Work is data, and hobby is also data, but I yearn for my roommate's two cats, lazily lounging at the doorway.