Paper Summary — A Generalization of Transformer Networks to Graphs

Published in

Technonerds

3 min readJun 5, 2021

Graph neural networks (GNNs) have been the most popular method to train models on real-world graphs. What are real-world graphs? The example I like best is a social network. You are a node in the graph, and you are connected to all the other nodes that represent your friends. Plus, these connections, aka edges of the graph, can have some associated measure, like how close you are with that friend. In graph neural networks, after defining a graph model, we can train it to learn certain attributes about each node using its neighbouring nodes using message-passing. You can think of this as a machine that is learning what your interests are depending on what your friends are interested in.

At the same time, the attention mechanism in the transformer model has shown its effectiveness on so many NLP (and now Vision) tasks. Generalizing the transformer architecture to work on graph data could enable advancements in ML for graphs, and that is exactly what this paper (Dwivedi et al. 2020) from NTU proposes to do.

A summary of the method

The original transformer architecture splits the data into multiple heads and applies self-attention by multiplying the keys (K) and queries (Q) in each head. Then, these attended features are scaled and normalized, and then multiplied with the values (V). To understand the architecture better, you can read this post or watch this video.

In this paper, edge information (E) is stored in an adjacency matrix, which acts as an input to each head in the multi-headed attention block, alongside K, Q, and V. In each head, after attention is calculated, the attention values are refined using the adjacency matrix, which may increase or decrease pairwise attention for each pair of nodes depending on whether they are connected using an edge and what the edge weight is. Furthermore, this adjacency matrix is refined at each step using the updated attention values. This way, the model can learn better node and edge representations simultaneously at each layer of the transformer.

In addition, the paper proposes Laplacian eigenvectors to determine node positions rather than the positional encodings for text that the original Transformer uses.

What I found most interesting

If you are familiar with the transformer model, you will find this model very easy to understand. I find the simplicity of this model quite interesting. As a baseline architecture to build attention-based graph network, this approach is quite interesting, although it may not be the best performing against certain other methods.

Why should you be excited (and why am I excited)?

The original transformer model was built for NLP, but it has already crossed borders into vision and graph networks. Most works in the last two years use transformers as it is, but modify the inputs based on the structure of the data. It’s exciting to see research where the transformer model itself is modified to work on a variety of different data types and tasks. I’m on the lookout for such modifications too, because I think there are many ways to specialize the transformer model for new domains and build state-of-the-art models.

I am writing a series of summaries of papers that I have been reading, mostly involving multimodal computer vision and NLP tasks. These summaries are in layman’s terms, and not detailed. You can find all the papers I have summarized here.

I am a student researcher at The University of British Columbia working on Vision and NLP tasks. If you are interested in these topics as well, let’s get in touch!

Paper Summary — A Generalization of Transformer Networks to Graphs

A summary of the method

What I found most interesting

Why should you be excited (and why am I excited)?

Written by Aditya Chinchure