**A Primer on Explainers, Explanations, and their Metrics in PyG: or how to explain explanations**

By Samy Cherfaoui, Eric Tang as part of the Stanford CS224W Course Project

*A full Colab with our code and additional commentary can be viewed **here**.*

Explainability is a relatively novel field in the world of graph neural networks. However, it is an important one nonetheless. As GNNs get increasingly larger and more complex, it becomes more and more intractable to understand how exactly a model makes its predictions. Sometimes, it is unclear how a prediction is being made and whether or not the model simply made a “lucky” guess or actually incorporated relevant information from other nodes as expected. Other times, we would simply like a level of model transparency to increase model fairness and trust. To that end, the PyG team has developed a suite of explainer algorithms which seek to produce explanations given a model, a dataset, and a specific node index. There are some excellent tutorials on how to generate explanations already so we will not dive into this particular facet of the pipeline in depth [1, 2, 3]. Instead, we will focus on how to evaluate and assess explainer algorithms and their explanations; after all, there is no use in utilizing an explainer algorithm if we have no way of determining whether or not the explanation itself satisfies some threshold of correctness. Therefore, the agenda for our tutorial is as follows:

- First, we will motivate explainers and explanations by implementing a simple model on a small homogeneous graph dataset. We will use this to show how explanations are constructed, how to interpret them, and then how to visualize their outputs.
- We will then compare and contrast different explainer algorithms and their metrics on explanations on our models. We will explain what each algorithm is, what each metric is, and how each explainer performs on a given metric with respect to a dummy baseline.
- We will show how to write your own metric to evaluate explainers.
- Finally, we will show how to create synthetic datasets which is what many explainer designers use to test newly constructed explainer algorithms and metrics. We will test our new metric from Step 3 on a prediction using this synthetic dataset.

Let’s get started!

*Explainers Motivation*

In order to proceed into explanation metrics, we must first motivate why explainers and explanations are even useful and where they lie in the traditional ML pipeline.

In the above simplified ML pipeline, we say that the proposed new ‘explain’ phase is a post-prediction phase. However, we can use these explanations to loop back into multiple previous phases. If we have thoroughly validated our model, our explanations can be used as additional markers for trust and transparency every time we generate a prediction. We can also use explanations to debug our models if we notice odd explanatory subgraphs (the subgraph which explains the predictions the most). We can use this data to inform tweaks to our model to retrain. We may even have to go back to the Pre-Process step if we learn some irrelevant feature has an outsized influence on our predictions. In that case, we may want to tweak our input features. Here are how the internals of this Explain box look:

There are four core classes that the PyG team has constructed to address each phase of this process. We can visualize how these specific classes interconnect in PyG with the following flowchart:

The light blue boxes represent the core classes, the arrows represent parameters to a given class, and the green circles represent additional information to provide. All that is left to do now is to explain what exactly *are *explanations? Are they text summaries? Are they some numbers we need to interpret? In reality, these explanations consist of a series of node masks (which mask out nodes or their features) and edge masks (which mask out edges or their features). These masks zero out unimportant node/edge features, nodes, or edges depending on the parameters you specify. This image from [1] presents a concise depiction of the available masking parameters:

By masking out unessential features or entire vectors themselves, we are able to better understand what structures or features make up the predictions in our model. We can then use Explanation’s visualize features to apply these masks to either networkx or graphviz. Now, let’s show how to actually implement this functionality in practice!

*Explainer Implementation*

First off, we need a simple model and dataset. We chose a 2-layer GAT neural network as our playground testing model. We chose to use 2 layers for our GAT to make it sufficiently complex yet not so large as to make the visualizations too dense (recall that a k-layer GNN utilizes information from its k-hop neighborhood so as k goes up, the larger the explanation subgraph will be). We chose specifically to make this a model based on GATConv because there is one explainerAlgorithm, AttentionExplainer, which specifically uses attention coefficients to generate explanations. We wanted to also be able to test this explainer for this tutorial. Each GATConv will have 3 heads to resemble a real production-ready GAT. We also added a relu and dropout layer after the first GATConv to increase robustness and expressivity. Our end goal was to predict node classification, so we added a log_softmax layer after our final GATConv to output our log-finalized probabilities.

As our testing dataset, we chose to use the Planetoid CORA dataset. We chose CORA because it is a small, homogenous dataset primarily used for node classification. The CORA dataset consists of 2708 scientific publications. We aim to predict which one of seven categories each publication belongs to. Each of the 5429 links represents a citation between two papers. Each node is represented with a 1433 dimension vector where each feature represents a word in a dictionary and 0/1 value depending on whether or not the word corresponding to the i-th dimension feature exists in the paper. Right off the bat, we can see how feature masks and edge masks can be useful: a feature mask can tell us which words in the publication specifically influenced the prediction and an edge mask will tell us which other publications factored the most into our predictions.

We considered using ogbn-arxiv, which is a similar dataset but with 169,343 nodes and over a million edges. However, we learned that trying to extract one explanation from this dataset can take hours to days in a traditional Colab environment. Therefore, it is important to remember to use a smaller dataset or some sort of clever sampling or subgraph strategy to use the explainability features. Generating explanations on larger datasets is an ongoing research area. We also specifically focused on homogeneous graphs because, as of the time of this writing, only one explainer (CaptumExplainer) offers support for heterogeneous data. We will briefly discuss how to generate explanations on HeteroData objects but we will limit our overall analysis to homogeneous datasets.

Let us start by downloading the necessary packages. As of the time of this writing, you will need to install PyG from source on Github since the last supported version on PyPi is 2.2 which does not support many of the exciting new explainer features we will discuss below.

`# Let us first install PyG.`

!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.13.1+cu116.html

!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.13.1+cu116.html

# Be aware that we need to download PyG from source since the latest version

# that supports explainability is not available on pip yet.

!pip install git+https://github.com/pyg-team/pytorch_geometric.git

!pip install ogb

ll PyG.

!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.13.1+cu116.html

!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.13.1+cu116.html

# Be aware that we need to download PyG from source since the latest version

# that supports explainability is not available on pip yet.

!pip install git+https://github.com/pyg-team/pytorch_geometric.git

!pip install ogb

The next few blocks of code should be fairly standard if you have experience with developing a model via PyTorch. You can also follow along via our colab. To create our model:

`import torch`

import torch.nn.functional as F

from torch_geometric.datasets import Planetoid

from torch_geometric.nn import GCNConv

from torch_geometric.nn.conv.gat_conv import GATConv

from torch_geometric.explain import Explanation

import matplotlib.pyplot as plt

import pandas as pd

dataset = 'Cora' # We want to use the CORA dataset.

dataset = Planetoid('.', dataset) # There is only one graph in the CORA dataset.

data = dataset[0]

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

data = data.to(device)

class GAT(torch.nn.Module):

def __init__(self, hidden_dim=32, heads=3):

super().__init__()

# Recall that our architecture is a 2-layer GATConv network.

# We will use 3 attention heads though you may use any head number

# that you wish.

self.conv1 = GATConv(dataset.num_features, hidden_dim, heads)

self.conv2 = GATConv(hidden_dim * heads, dataset.num_classes, 1)

"""

We apply relu and dropout after our first GATConv layer to make our results

more expressive and robust. We then apply our second GATConv layer and finally

apply a log softmax layer to output log-finalized probabilities across the 7 CORA

classes.

"""

def forward(self, x, edge_index):

x = F.relu(self.conv1(x, edge_index))

x = F.dropout(x, training=self.training)

x = self.conv2(x, edge_index)

return F.log_softmax(x, dim=1)

model = GAT().to(device)

# We will use the Adam optimizer.

optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

To train the model:

`# Setup Dataloader`

from torch_geometric.data import DataLoader

train_loader = DataLoader(dataset, batch_size=32, shuffle=False)

# Training Loop

losses = []

for epoch in range(1, 250):

total_loss = 0

model.train()

for batch in train_loader:

optimizer.zero_grad()

pred = model(data.x, data.edge_index)

label = batch.y

pred = pred[batch.train_mask]

label = label[batch.train_mask]

loss = F.nll_loss(pred, label)

loss.backward()

optimizer.step()

total_loss += loss.item() * batch.num_graphs

total_loss /= len(train_loader.dataset)

losses.append(total_loss)

To test the model:

`# Get Test dataloader`

test_loader = DataLoader(dataset, batch_size=32, shuffle=False)

model.eval()

correct = 0

for data in test_loader:

with torch.no_grad():

# max(dim=1) returns values, indices tuple; only need indices

pred = model(data.x, data.edge_index).max(dim=1)[1]

label = data.y

# node classification: only evaluate on nodes in test set

pred = pred[data.test_mask]

label = label[data.test_mask]

correct += pred.eq(label).sum().item()

total = 0

for data in test_loader.dataset:

total += torch.sum(data.test_mask).item()

print(f"Test set accuracy: {correct/total}")

This has all been fairly standard so far. If you follow the steps above, you should have a model that hovers around an 80% test set accuracy. However, now we finally add our explainers:

`from torch_geometric.explain import Explainer, Explanation, GNNExplainer`

from torch_geometric.explain.config import ExplainerConfig, ModelMode

# Here, we initialize our explainer.

explainer = Explainer(

# It takes in your model.

model=model,

# You also need to pass in what ExplainerAlgorithm you would like to use.

# In this case, we used GNNExplainer for this demo. You can optionally

# provide an epochs parameter (like we do here) to tell the explainer how

# many epochs you would like to iterate over.

algorithm=GNNExplainer(epochs=200),

# The explanation_type is either 'model' or 'phenomenon'.

# 'model' explains your model's prediction whereas 'phenomenon' explains

# the phenomenon that the model is trying to predict by using the target

# output instead of the model predicted output.

explanation_type='model',

# The type of node mask to apply. 'None' will do nothing, 'attributes' will

# mask features on a per node basis, 'common_attributes' will mask features

# across all nodes, and 'object' will mask an other node (good for structural

# predictions). In this case, we want our node mask to identify the most important

# features for a given node index.

node_mask_type='attributes',

# Accepts same output as node mask type but for edges. For our dataset, edges

# do not have features so we will use 'object' to instead obtain a structural

# explanation of what edges were incorporated into our prediction.

edge_mask_type='object',

# Model config to specify what type of task we are running: in this case,

# 7-class node classification using log probabilities. Recall that we used

# log softmax in our model so our return type is log_probs here.

model_config=dict(mode='multiclass_classification', task_level='node', return_type='log_probs'),

)

# Specify the index of the node we want to make a prediction/explanation over.

node_index = 10

# In order to generate explanations for heterogeneous data, all we need to do

# is have the type of `data` be HeteroData! All the steps above are still the same

# but instead of generating an explanation, we will now have a HeteroExplanation.

# Keep in mind, however, that we will need to use a CaptumExplainer (we cannot use

# GNNExplainer for heterogeneous data).

explanation = explainer(data.x, data.edge_index, index=node_index)

# This print statement lists the masks generated by explanation.

print(f'Generated explanations in {explanation.available_explanations}')

# Here we construct a visualization of the top 10 most important features.

# Keep in mind that we can also call explanation.visualize_feature_importance(top_k=10).

# However, we explicitly write out the code here for those on PyTorch Geometric 2.2

# That version does not support the above method so we instead write out an alternative

# here.

node_mask = explanation.node_mask # you may need to use explanation.node_feat_mask for PyG 2.2

feat_importance = node_mask.sum(dim=0).cpu().numpy()

feat_labels = range(feat_importance.shape[0])

df = pd.DataFrame({'feat_importance': feat_importance},

index=feat_labels)

df = df.sort_values("feat_importance", ascending=False)

df = df.round(decimals=3)

df = df.head(10)

title = f"Feature importance for top {len(df)} features"

ax = df.plot(

kind='barh',

figsize=(10, 7),

title=title,

ylabel='Feature label',

xlim=[0, float(feat_importance.max()) + 0.3],

legend=False,

)

plt.gca().invert_yaxis()

ax.bar_label(container=ax.containers[0], label_type='edge')

plt.show()

Notice here how we were able to construct a node mask that tells us what the most important features to make these predictions are! We learned that word 1263 in the dictionary was most influential in determining the prediction of node label 10. Notice in our comments above how we can actually replicate this process for heterogeneous data; it’s the exact same procedure! We just need to be sure we are using CaptumExplainer if we want to use heterogeneous datasets. Let us now visualize the explanatory subgraph.

`# Visualize the explanatory subgraph. The image is saved to your first`

# parameter file path. There are two supported backends: graphviz

# and networkx but we found graphviz to be more aesthetically pleasing.

explanation.visualize_graph('subgraph_10.png', backend="graphviz")

And here is the resulting subgraph:

As a refresher on graph neural networks, recall that a node’s prediction is determined by aggregating state information from its k-hop neighborhood (given a model with k layers). However, without tracing through our computation graphs and having knowledge of the weights in our models, it is extremely difficult to see what specific edges in this neighborhood contributed the most to our prediction. Here, we immediately can view this through our explanations: edges with higher opacity correspond to more influence in the prediction. It is interesting to note that the node we make predictions over (node 10) influences itself by passing its own state to its neighbors, who then incorporate this information into their own state in the next round before passing it back. We then re-ran the above code for another node index 476 and took a look at the explanatory subgraph:

Here, we notice one drawback of explanation visualizations. If the k-hop neighborhood of a node is quite large, as is the case with node 476, the visualizations become almost useless since it is quite difficult to read. However, recall that these visualizations come from our node and edge masks: we can perform operations on these masks to only include the edges with the most influence and not every edge with nonzero influence like visualize_graph does. A possible code contribution to PyG could be modifying visualize_graph to take in a top_k parameter and only include edges (and the corresponding nodes) that correspond to the top k edges by influence and drawing out that subgraph.

Here, we have seen the usefulness of explanations. However, what use are explanations if we have no way of evaluating their effectiveness? And how would these explanations be different if we used a different type of explainer algorithm? In the next section, we will compare and contrast the existing explainer algorithms against all the evaluation metrics PyG has to offer.

*Battle Royale: Comparing and Contrasting Explainer Algorithms*

We provide visualizations of the outputs of each of these explainers as well as their metric evaluations on a sample node index in our colab.

**DummyExplainer**

DummyExplainer is an explainer that simply returns random values for node and edge masks. We can use it as a baseline for metrics when we compare and contrast different explainer algorithms.

**GNNExplainer**[5]

GNNExplainer is an explainer that takes in a GNN model and a node prediction or set of node predictions as input, and outputs a compact subgraph of the GNN’s computational graph that explains the given node predictions.

GNNExplainer additionally generates a node feature mask that masks out unimportant features. Both of these are learned via maximization of the mutual information objective, which empirically is done by learning masks to be applied to the adjacency matrices of computational graphs for predictions, and the node level features of the subgraphs of the computational graphs.

**CaptumExplainer**[6]

Captum is a library for PyTorch that supports interpretability of models. It consists of primary attribution, layer attribution, and neuron attribution algorithms. Currently in PyTorch Geometric, the algorithms supported for explainability are all primary attribution methods, which determine the contribution of each input feature to the output of the model. The following primary attribution methods are currently supported to be used in an Explainer module:

- Integrated Gradients: represents the integral of gradients along the model from output to input
- Saliency: returns the gradient of the output with respect to the input
- Input X Gradient: an extension of saliency that returns the gradient of the output with respect to the input multiplied by the input feature values
- Deconvolution: the gradient of the output with respect to the input is calculated, but with backprop of ReLU functions overridden by applying ReLU to the output gradients and directly back propagated.
- Guided Backpropagation: similar to deconvolution, but the ReLU function is applied to the input gradients
- Shapley Value Sampling: Shapley values are generally computed by taking each permutation of input features and adding them one by one to a given baseline to compute their contribution to the output. Since this method of attribution is computationally expensive, Shapley value sampling samples random permutations, and averages the marginal contribution of each feature based on these samples. We noticed this attribution method was quite slow still (with a feature vector size of 1433) so use with caution.

As of the time of this writing, one drawback of CaptumExplainer is that it does not supported many of the PyG-provided evaluation metrics since it acts as a wrapper for the Captum library which represents its outputs in ways that are unfamiliar to PyG.

**PGExplainer**[7]

PGExplainer focuses on explanations on graph structures rather than node features by learning a generative probabilistic graph model that models the underlying structure as edge distributions, where the explanatory graph is sampled. This allows for a more global view of explanations, as well as use in both transductive and inductive settings.

Since PGExplainer is a parameterized model, it requires training before being applied to produce explanations. We cannot simply reuse the explanation code from above. Here is the code we used to implement and train the explainer. Note how we do not provide node masks because they are not supported!

`pg_explainer = Explainer(`

model=model,

algorithm=PGExplainer(epochs=30, lr=0.003),

# PGExplainer only supports a phenomenon explanation type which means it only

# generates explanations over the expected output, not the predicted output.

explanation_type='phenomenon',

# PGExplainer does not support node masks! This means that we cannot

# generate the node features bar chart we were able to for the previous

# explainers.

edge_mask_type='object',

model_config=dict(mode='multiclass_classification', task_level='node', return_type='log_probs'),

)

# Unlike the previous explainers, we must train our ExplainerAlgorithm.

# This is because PGExplainer is a parametrized explainer which must learn

# parameters to determine the best explanation possible for a given input and output.

# Notice how we provide the actual target (data.y) during our training process.

for epoch in range(30):

for index in torch.LongTensor(np.random.randint(0, len(data.x), 20)):

loss = pg_explainer.algorithm.train(epoch, model, data.x, data.edge_index,

target=data.y, index=index.item())

The PGExplainer takes in node representations and the original graph to compute the latent variables for the edge distributions, which are used to generate the sampled graph to be passed into the trained GNN model. The cross entropy of the modified prediction and the original prediction is then used to train the PGExplainer.

**AttentionExplainer**

AttentionExplainer uses the attention coefficients produced in an attention based GNN as an edge explanation. As such, node level explanations are not supported. More precisely, we can imagine that the edge mask produced by other explanation modules for explainability is directly output by the attention explainer as a reduction over the attention coefficients of each GAT layer. By default, max reduction is used to gather the attention coefficient for a particular edge to explain a given prediction.

*Metrics*

There exist a variety of metrics that can be used to evaluate the quality of an explanation. Since explanations primarily consist of node and edge masks that are used to generate explanatory subgraphs or feature subsets, metrics seek to quantify how well these subsets of the original graph are able to generate the original model prediction. The currently supported metrics in Pytorch Geometric are detailed below.

**Ground Truth Metrics**

Ground truth metrics allow you to compare an explanation’s masks with some target masks. We will see in later sections how to use synthetic datasets to construct graphs that have explicit target masks which we can compare against. If your models did have a target mask, ground truth metrics allow us to compare these via accuracy, recall, precision, F1 score and/or AUROC.

**Fidelity**[8]

Fidelity evaluates the contribution of a particular subgraph to the initial prediction. For model predictions, fidelity is calculated as two values, which are calculated by either evaluating the prediction solely on the explanatory subgraph, or on the original graph with the explanatory subgraph removed. For a good explanation, we expect the former value to be high, and the latter to be low. Formally, fidelity is expressed as follows:

Calculating the fidelity score on our model using the explanation on node index 10 from above, we learned that GNNExplainer has a perfect fidelity score whereas the other explainers have a fid+ and fid- score of 0. This seems to indicate that the model generates the same predictions between the masked subgraph, the graph as a whole, and the rest of the graph minus the subgraph.

**Characterization Score**[8]

Characterization score incorporates the two values of the fidelity into a single metric as shown below.

It assigns weights to each of the fidelity scores, which are default 0.5 for both. The closer the characterization score is to 1, the better it is, since it indicates that both fidelity values are close to their desired values of 1 and 0 respectively for fid+ and fid-. Following from above, GNNExplainer had a perfect score of 1 whereas the other explainers had the worst possible score of 0 since fid+ is 0, leading to a characterization score of 0 as well.

**Fidelity Curve AUC**[8]

The Fidelity Curve AUC metric computes the area under the curve of the following expression.

This is an alternative to the characterization score for combining the two fidelity values and it allows you to determine a metric for explanations across all inputs to your model. Following from above, GNNExplainer had a perfect score of 1 whereas all the other explainers had a score of 0 (recall that the size of our tensor x is only 1 since we only ran one explanation).

**Unfaithfulness**[9]

Unfaithfulness attempts to quantify how faithful an explanation is to an underlying GNN predictor by using the KL divergence between the original prediction probability vector y, and the explanation prediction vector y . Formally it is defined as follows:

Unlike previous metrics, we want unfaithfulness to be close to 0, since we want the KL divergence between the two predictions to be low. In our case study, we noticed that even though GNNExplainer had the best fidelity-based scores so far, PGExplainer had the best unfaithfulness score. GNNExplainer and AttentionExplainer also both had low scores while DummyExplainer understandably performed very poorly. This shows how we need a wide variety of metrics that measure different distribution comparisons before we can assess the performance of any given explainer over another.

*Writing your own metric*

All the explanation evaluation metrics discussed above are extremely new: in fact, the oldest one comes from late 2022 (with the exception of groundtruth_metrics). There may still be better metrics out there! Luckily, PyG presents a very simple way of doing this. Unlike ExplainerAlgorithms which all inherit from the base ExplainerAlgorithm class, metrics can be whatever you would like. If you have a local fork of the PyG repo, all you would need to do is put the code in torch_geometric.explain.metric and add it is a class under pytorch_geometric/torch_geometric/explain/metric/__init__.py. You may remember how in the diagram of the explainability framework above, we noted that metrics take in an explanation and explainer. This need not necessarily be true; however, unless our metric is a second-order metric that relies on other ones, it is best to include both the explanation and explainer for reasons that will soon be made clear. Let us come up with a new metric called difference_metric:

`def difference_metric(explainer, explanation):`

"""

Let us come up with a somewhat contrived metric called the difference metric to motivate

using Explainer and Explanation features in metrics. This metric is similar to fidelity

but instead of cutting out or only keeping the explanatory subgraph, we compare when

the predictions with the masks applied are the same as the predictions on the full graph

and our metric is the number of times these predictions are the same over the total number

of nodes. Ideally, this metric should be close to 1 since it would mean that this smaller

explanatory subgraph can perfectly make the same predictions as our full graph.

"""

node_mask = explanation.get('node_mask')

edge_mask = explanation.get('edge_mask')

# Explainer has a handy feature called `get_masked_prediction` which you may want

# to use for your own metrics. We can see here how Explainers act as a wrapper

# around your model. In the following statement, we generate a prediction given x and

# an edge_index, and this should output the same results as model(data.x, data.edge_index).

log_soft_with_full_graph = explainer.get_masked_prediction(explanation.x, explanation.edge_index)

# However, what makes get_masked_prediction special is that we can also see how the model would

# predict if we applied our explanation's node mask and edge mask!

log_soft_with_mask = explainer.get_masked_prediction(explanation.x, explanation.edge_index, node_mask, edge_mask)

# get_target is used to come up with the final class label given

# our log probabilities. Explainer knows how to do this because

# we provided the return_type and task mode and level in the Explainer.

pred_with_full_graph = explainer.get_target(log_soft_with_full_graph)

pred_with_mask = explainer.get_target(log_soft_with_mask)

# See where the values are the same

same = pred_with_full_graph == pred_with_mask

# Divide by the total number of nodes.

return same.sum()/len(same)

print(f"Difference metric on our GNN Explainer: {difference_metric(explainer, explanation)}")

Difference_metric is similar to fidelity but instead of removing or only keeping the explanation subgraph, it does a comparison between the subgraph predictions (along with the node feature mask) and the predictions from the entire graph. It takes the sum of times when the predictions are the same and divides by total number of nodes. We want a higher value (closer to 1) since it indicates that just the subgraph alone can produce the same predictions as the entire graph.

With this example, we can see why we would want *both *the explanation and explainer to be passed into the metric. Explanations contain information about the node and edge masks but explainers preserve information about the model itself, serving as a wrapper for the model to make predictions with and without the existence of a given mask.

*Putting it all together*

So far, we’ve seen how to explain model predictions, how to interpret and visualize these results, how to choose an explainer, and even how to evaluate the success of your explanations with either your own metrics or custom metrics. So what’s next? Well, an important thing to keep in mind is that we’ve been evaluating our explanations on one node index in one dataset. It is entirely possible that we could have generated an outlier explanation with this minimal sample size. Luckily, PyG has a set of features that enable you to create your own synthetic benchmark datasets!

PyG uses a construct known as ExplainerDataset to support developing your own synthetic data. PyG supports three right out of the box: InfectionDataset, BA2MotifDataset, and BAMultiShapesDataset. The latter two are mainly used for graph classification so let’s generate a sample InfectionDataset.

`# Helper function to visualize`

def visualize(dataset):

# Construct a networkx representation.

graph_data = dataset[0]

graph = nx.DiGraph()

graph.add_nodes_from([node for node in range(graph_data.num_nodes)])

graph.add_edges_from([(a.item(), b.item()) for a, b in zip(graph_data.edge_index[0], graph_data.edge_index[1])])

color_map = []

# Identify nonzero nodes with red.

for node in graph_data.y:

if node > 0:

color_map.append('red')

else:

color_map.append('green')

# Draw the graph

nx.spring_layout(graph)

nx.draw(graph, node_color=color_map, node_size=400, with_labels=True)

# This is the example given from

# https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.InfectionDataset.html#torch_geometric.datasets.InfectionDataset

from torch_geometric.datasets import InfectionDataset

from torch_geometric.datasets.graph_generator import ERGraph

from torch_geometric.utils import to_networkx

import networkx as nx

# We want 10 nodes total, of which 5 are infected, with an edge probability

# between arbitrary nodes of 0.4. The max path length in the graph is 3.

# InfectionDataset can also take a num_graphs argument but we used the default 1.

dataset = InfectionDataset(

graph_generator=ERGraph(num_nodes=10, edge_prob=0.4),

num_infected_nodes=5,

max_path_length=3,

)

visualize(dataset)

Notice the level of customizability: we were able to specify graph type, number of nodes, number/probability of edges and max path lengths. We could have even generated more than one graph if we wished. However, possibly the most powerful feature is the ability to build your own ExplainerDataset. You are not constrained to just the three mentioned above!

`from torch_geometric.datasets import ExplainerDataset`

from torch_geometric.datasets.graph_generator import GridGraph

"""

Using ExplainerDataset we can create any arbitrary synthetic dataset we wish!

ExplainerDataset takes in a graph_generator which can generate either random graphs

or 2D grid graphs currently (though you can implement your own by extending

GraphGenerator). It also takes in a motif_generator and num_motifs which you can

use to generate a specific motif a set number of times (there is even support for

CustomMotif to import your own). It also takes num_graphs but we set this to default

1. 86 edges

"""

dataset = ExplainerDataset(

graph_generator=GridGraph(height=6, width=5),

motif_generator='house',

num_motifs=1,

)

visualize(dataset)

Here we were able to create our own grid graph with one singular house motif on the side! Since you can extend GraphGenerator, MotifGenerator, and even ExplainerDataset, there are an infinite number of ExplainerDatasets you can construct. One particular feature to point out is that MotifGenerator can take in a CustomMotif that you can draw on networkx. This allows you to define unique motifs that are exclusively applicable to your specific domain.

Now that we have every building block, we can finally show the full explainer evaluation pipeline from start to finish. First, we create a synthetic dataset. Then, we train and test our model on it. We then construct an explainer from this model (in this case, we will use GNNExplainer) and generate an explanation over an arbitrary node. We can then use the difference metric we designed above to assess the quality of this explanation given our explainer!

`from sklearn.model_selection import train_test_split`

from torch_geometric.datasets.graph_generator import BAGraph

import torch_geometric.transforms as T

# Let's use a slightly larger, less structured dataset.

dataset = ExplainerDataset(

graph_generator=BAGraph(num_nodes=100, num_edges=1),

motif_generator='house',

num_motifs=80,

# We add a transform argument to add a feature matrix for our nodes.

transform=T.Constant(),

)

data = dataset[0].to(device)

# We need to use this train-test split since ExplainerDataset doesn't

# generate its own train_mask and test_mask

train_idx, test_idx = train_test_split(torch.arange(data.num_nodes), train_size=0.8, stratify=data.y)

model = GAT()

train(model, data, train_idx)

test(model, data, test_idx)

explainer = Explainer(

model=model,

algorithm=GNNExplainer(epochs=200),

explanation_type='model',

node_mask_type='attributes',

edge_mask_type='object',

model_config=dict(mode='multiclass_classification', task_level='node', return_type='log_probs'),

)

node_index=10

explanation = explainer(data.x, data.edge_index, index=node_index)

explanation.visualize_graph("final_gnn_graph_10.png", backend="graphviz")

plt.imshow(plt.imread("final_gnn_graph_10.png"))

print(f"The difference metric result of our model and explanation is: {difference_metric(explainer, explanation)}")

Our difference metric outputs a low score indicating that our explainer algorithm’s explanations are not great. However, this makes sense: we use a 2-layer GAT but the node labels in our synthetic graph are determined by their placement in the house motif which we need a minimum of 3 layers to learn. Therefore, our evaluation metric was extremely helpful in helping us to identify shortcomings in our explainer (and by extension, the model). This shows the importance of an additional explanation and evaluation pipeline during the post-prediction phase.

Through this tutorial, we hope to have empowered you to incorporate explainability into your graph ML workflow. It bears repeating that explainability in graph ML is an exciting new field; to that end, we also hope we’ve inspired you to get your hands dirty by constructing your own explainers, metrics, and synthetic datasets!

Citations:

[1] https://medium.com/@pytorch_geometric/graph-machine-learning-explainability-with-pyg-ff13cffc23c2

[2] https://arshren.medium.com/explainability-of-graph-neural-network-52e9dd43cf76

[3] https://pytorch-geometric.readthedocs.io/en/latest/tutorial/explain.html

[4] https://graphsandnetworks.com/the-cora-dataset/

[5] Ying, Rex, et al. “Gnn explainer: A tool for post-hoc explanation of graph neural networks.” *arXiv preprint arXiv:1903.03894* 8 (2019).

[7] Luo, Dongsheng, et al. “Parameterized explainer for graph neural network.” *Advances in neural information processing systems* 33 (2020): 19620–19631.

[8] Amara, Kenza, et al. “GraphFramEx: Towards Systematic Evaluation of Explainability Methods for Graph Neural Networks.” *arXiv preprint arXiv:2206.09677* (2022).

[9] Agarwal, Chirag, et al. “Evaluating explainability for graph neural networks.” *arXiv preprint arXiv:2208.09339* (2022).

[10] Velickovic, Petar, et al. “Graph attention networks.” *stat* 1050.20 (2017): 10–48550.