Summary: Pitfalls of Graph Neural Network Evaluation

Pouya Pezeshkpour
UCI NLP
Published in
3 min readNov 21, 2018

Authors: Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, Stephan Günnemann

In this paper, the authors consider the problem of semi-supervised node classification on graph data and study the evaluation of existing methods. The node classification task on the graphs is defined as labeling all the nodes in the graph using three elements, 1) graphical structure of data, i.e., the links between the nodes, 2) the feature vector assigned to each node, and 3) the known labels of a few nodes. Consequently, we can consider two scenarios for node classification on the graphs, 1) inductive: defines as finding a function f(d(v))=L(v), which assign a label to each node using its description vector (capturing neighborhood information and the feature vector for the node), and 2) transductive: defines as finding the labels of every node in the graph using the known labels, structure information and features vectors.

This work focuses on the transductive scenario, comparing four well-known methods previously introduced for the node classification: GCN [1], MoNet [2], GAT [3], and GraphSAGE [4]. To compare these methods, authors argue two shortcomings of previous evaluations, 1) most of the proposed methods have been tested exclusively on the same train/validation/test splits of the same three datasets (CORA, CitSeer, and PubMed). 2) most of these methods use training procedure that is rather different from the one used for the baselines. As a result, they introduce 4 new datasets and find the average performance of each one of the methods on 2,000 random settings (100 random splittings of the datasets and for each one of them 20 random initializations of parameters), arguing that this setup allows them to more accurately assess the generalization performance of different models, and does not just select the model that overfits one fixed test set. Furthermore, they provide a fair environment for training by restricting similar setting for each one of the methods and tuning the hyperparameters using a grid search on the same possible sets of values. As a result, the average ranking (for each dataset they rank all the methods from 1 to 10 based on their average performance) of the four methods and few more baselines is provided as follows:

Interestingly, the results show that the most simple method (GCN) achieves the overall best result. Furthermore, the performance of each method on the three common datasets using different splitting of the data is provided as follows:

As it shows, splitting the data in a different way results in a different performance of each method.

I personally find this paper very interesting, because of the fact that these days most of the research work only focuses on achieving a state of the art result on a fixed dataset using some predetermined evaluation metrics. This is even more severe for the tasks on the graph data such as node classification and link prediction. As an example of this exclusive focus on the performance, just recently it is shown that two of the most well-known knowledge graphs FreeBase and WordNet have inverse relations in them making the evaluations untrustworthy. As for this paper, although both of the arguments are sound for the evaluation process of previous works, I believe enforcing the same training setting on all the methods is not very fair either, because those setting may favor some of the models more causing incorrect comparison between models.

[1] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2017.

[2] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model CNNs. CVPR, 2017.

[3] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks. ICLR, 2018.

[4] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. NIPS, 2017.

--

--