Graphormer — the generalization of the transformer architecture to graphs

Published in

Towards NeSy

4 min readNov 1, 2021

https://en.m.wikipedia.org/wiki/File:Chemical_Graph_Representation.svg

This article is a brief introduction to the recent work of Microsoft researchers. Feel free to jump to the paper directly here. Their generalization of the transformer architecture for modeling molecular graphs works perfectly and won them the 1st place of quantum prediction track of Open Graph Benchmark Large-Scale Challenge (KDD CUP 2021).

The Transformer is famous for modeling sequential information. Its core module, the self-attention block, captures the semantic similarity between tokens in a sentence or sentence-like structures like SMILES sequences for a molecule. Graphormer (Transformer for graph) incorporates several structural encoding methods to model other useful information in a graph, namely centrality encoding and spatial encoding.

Let’s start with some background knowledge on graphs.

The Centrality of a node

Graphs consists of nodes and edges, where the nodes are an abstraction of entities and the edges are the relationships between them. Graphs are usually so large in some applications that we can not have a clear visualization of what we are dealing with (like social networks). In complex network analysis, when it comes to a specific node, we usually ask a few questions about it:

What is the position of a node in a graph?
How important is this node?

Centrality comes to the center of those questions. We will introduce three kinds of centrality: betweenness, closeness, and degree centrality.

The first two centralities are relative to the shortest path in a graph. For the former one, ask this question: Considering all pairs of nodes in a graph, how many of them have to use a specific node as the communication bridge in the shortest path between them? The higher the number is, the higher the betweenness centrality is for this node. And for closeness centrality, we calculate the sum of the shortest path to all other nodes.

The degree centrality, which is the one used in this paper, is simpler in concept: the more neighbors a node has, the more central a node is. The authors assign a learnable vector to a node based on its degree centrality and then add it up in the node features. For directed graphs (like influencer and follower relation), two vectors are assigned to model the in-degree and the out-degree.

The spatial relation between nodes

The difference between graph-structured data and other structured data (images, text sequences) is that nodes are in non-Euclidean space linked by edges. The common properties we have in Euclidean geometry are not applicable anymore, like angles and circles (not circles in graphs but circles constrained by Pi).

This paper defined that not only the embeddings of two nodes are useful, but all the information in the shortest path between them is important too. For each node pair, they assign a learnable embedding based on their spatial relation. To conclude, they compute an average of dot-products of the edge features and learnable embeddings along the shortest path, then use it in the attention module.

In the above demonstration, the spatial and edge encoding act as the bias term in the softmax attention calculation. The centrality encoding (the learnable vector based on the node degree) is added to the node feature.

Implementation of Graphormer

We are throwing all the implementation details here as a dictionary.

Layer Normalization before the multi-head self-attention (MHA) and the feed-forward blocks (FFN) instead of after compared to original Transformer design. reference link
FFN has three layers of hidden dimensions d for simplicity. (same as the length of the node embedding dimension)
Add a VirtualNode to represent Graph-level representation. (similar to CLS token in BERT models) reference link
VirtualNode is connected to every existing node and trained as normal nodes except that it’s assigned a distinct learnable scalar. reference link Originally it’s called the Graph Warp Module → link
Another technique to replace positional encoding and model the spatial encoding is using Laplacian eigenvectors. reference link

GitHub - microsoft/Graphormer: This is the official implementation for "Do Transformers Really…

By Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He*, Yanming Shen and Tie-Yan Liu. This repo…

github.com

References:

Do Transformers Really Perform Bad for Graph Representation?

The Transformer architecture has become a dominant choice in many domains, such as natural language processing and…

arxiv.org

On Layer Normalization in the Transformer Architecture

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs…

arxiv.org

Neural Message Passing for Quantum Chemistry

Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials…

arxiv.org

Rethinking Positional Encoding in Language Pre-training

In this work, we investigate the positional encoding methods used in language pre-training (e.g., BERT) and identify…

arxiv.org

Graph Warp Module: an Auxiliary Module for Boosting the Power of Graph Neural Networks in Molecular…

Graph Neural Network (GNN) is a popular architecture for the analysis of chemical molecules, and it has numerous…

arxiv.org

A Generalization of Transformer Networks to Graphs

We propose a generalization of transformer neural network architecture for arbitrary graphs. The original transformer…

arxiv.org

Over smoothing issue in graph neural network

A visual guide to graph neural networks and over-smoothing issue

towardsdatascience.com

Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘06). Association for Computing Machinery, New York, NY, USA, 631–636. DOI:https://doi.org/10.1145/1150402.1150479

Graphormer — the generalization of the transformer architecture to graphs

The Centrality of a node

The spatial relation between nodes

Implementation of Graphormer

Other readings:

GitHub - microsoft/Graphormer: This is the official implementation for "Do Transformers Really…

By Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He*, Yanming Shen and Tie-Yan Liu. This repo…

References:

Do Transformers Really Perform Bad for Graph Representation?

The Transformer architecture has become a dominant choice in many domains, such as natural language processing and…

On Layer Normalization in the Transformer Architecture

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs…

Neural Message Passing for Quantum Chemistry

Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials…

Rethinking Positional Encoding in Language Pre-training

In this work, we investigate the positional encoding methods used in language pre-training (e.g., BERT) and identify…

Graph Warp Module: an Auxiliary Module for Boosting the Power of Graph Neural Networks in Molecular…

Graph Neural Network (GNN) is a popular architecture for the analysis of chemical molecules, and it has numerous…

A Generalization of Transformer Networks to Graphs

We propose a generalization of transformer neural network architecture for arbitrary graphs. The original transformer…

Over smoothing issue in graph neural network

A visual guide to graph neural networks and over-smoothing issue

Written by Jesse Jing

Graphormer — the generalization of the transformer architecture to graphs

The Centrality of a node

The spatial relation between nodes

Implementation of Graphormer

Other readings:

GitHub - microsoft/Graphormer: This is the official implementation for "Do Transformers Really…

By Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng*, Guolin Ke*, Di He*, Yanming Shen and Tie-Yan Liu. This repo…

References:

Do Transformers Really Perform Bad for Graph Representation?

The Transformer architecture has become a dominant choice in many domains, such as natural language processing and…

On Layer Normalization in the Transformer Architecture

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs…

Neural Message Passing for Quantum Chemistry

Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials…

Rethinking Positional Encoding in Language Pre-training

In this work, we investigate the positional encoding methods used in language pre-training (e.g., BERT) and identify…

Graph Warp Module: an Auxiliary Module for Boosting the Power of Graph Neural Networks in Molecular…

Graph Neural Network (GNN) is a popular architecture for the analysis of chemical molecules, and it has numerous…

A Generalization of Transformer Networks to Graphs

We propose a generalization of transformer neural network architecture for arbitrary graphs. The original transformer…

Over smoothing issue in graph neural network

A visual guide to graph neural networks and over-smoothing issue

Written by Jesse Jing

By Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He*, Yanming Shen and Tie-Yan Liu. This repo…