Graphormer — the generalization of the transformer architecture to graphs

Jesse Jing
Towards NeSy
Published in
4 min readNov 1, 2021
https://en.m.wikipedia.org/wiki/File:Chemical_Graph_Representation.svg

This article is a brief introduction to the recent work of Microsoft researchers. Feel free to jump to the paper directly here. Their generalization of the transformer architecture for modeling molecular graphs works perfectly and won them the 1st place of quantum prediction track of Open Graph Benchmark Large-Scale Challenge (KDD CUP 2021).

The Transformer is famous for modeling sequential information. Its core module, the self-attention block, captures the semantic similarity between tokens in a sentence or sentence-like structures like SMILES sequences for a molecule. Graphormer (Transformer for graph) incorporates several structural encoding methods to model other useful information in a graph, namely centrality encoding and spatial encoding.

Let’s start with some background knowledge on graphs.

The Centrality of a node

Graphs consists of nodes and edges, where the nodes are an abstraction of entities and the edges are the relationships between them. Graphs are usually so large in some applications that we can not have a clear visualization of what we are dealing with (like social networks). In complex network analysis, when it comes to a specific node, we usually ask a few questions about it:

  • What is the position of a node in a graph?
  • How important is this node?

Centrality comes to the center of those questions. We will introduce three kinds of centrality: betweenness, closeness, and degree centrality.

The first two centralities are relative to the shortest path in a graph. For the former one, ask this question: Considering all pairs of nodes in a graph, how many of them have to use a specific node as the communication bridge in the shortest path between them? The higher the number is, the higher the betweenness centrality is for this node. And for closeness centrality, we calculate the sum of the shortest path to all other nodes.

The degree centrality, which is the one used in this paper, is simpler in concept: the more neighbors a node has, the more central a node is. The authors assign a learnable vector to a node based on its degree centrality and then add it up in the node features. For directed graphs (like influencer and follower relation), two vectors are assigned to model the in-degree and the out-degree.

The spatial relation between nodes

The difference between graph-structured data and other structured data (images, text sequences) is that nodes are in non-Euclidean space linked by edges. The common properties we have in Euclidean geometry are not applicable anymore, like angles and circles (not circles in graphs but circles constrained by Pi).

This paper defined that not only the embeddings of two nodes are useful, but all the information in the shortest path between them is important too. For each node pair, they assign a learnable embedding based on their spatial relation. To conclude, they compute an average of dot-products of the edge features and learnable embeddings along the shortest path, then use it in the attention module.

https://arxiv.org/pdf/2106.05234.pdf

In the above demonstration, the spatial and edge encoding act as the bias term in the softmax attention calculation. The centrality encoding (the learnable vector based on the node degree) is added to the node feature.

Implementation of Graphormer

We are throwing all the implementation details here as a dictionary.

  • Layer Normalization before the multi-head self-attention (MHA) and the feed-forward blocks (FFN) instead of after compared to original Transformer design. reference link
  • FFN has three layers of hidden dimensions d for simplicity. (same as the length of the node embedding dimension)
  • Add a VirtualNode to represent Graph-level representation. (similar to CLS token in BERT models) reference link
  • VirtualNode is connected to every existing node and trained as normal nodes except that it’s assigned a distinct learnable scalar. reference link Originally it’s called the Graph Warp Module → link
  • Another technique to replace positional encoding and model the spatial encoding is using Laplacian eigenvectors. reference link

Other readings:

A medium article introducing the over-smoothing issue in graph neural networks.

For larger graphs, calculating node features (centrality)could be intractable. Thus we need graph sampling techniques to derive representative sub-graphs. A paper from Jure Leskovec published in 2006 discussed this properly.

The official GitHub repo is released:

References:

Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘06). Association for Computing Machinery, New York, NY, USA, 631–636. DOI:https://doi.org/10.1145/1150402.1150479

--

--

Jesse Jing
Towards NeSy

CS PhD student at ASU pushing the frontier of Neural Symbolic AI. We host the publication @Towards Nesy Try submitting your technical blogs as well!