A comprehensive introduction to GNNs — Part 2

Published in

Analytics Vidhya

11 min readOct 22, 2021

Why GNNs?

Before you start reading

As indicated in the title, this is the second part of a series of publications intended to introduce you to the GNNs. Hence, I strongly recommend you to have a look at the first part in which I discussed about graphs, directed graphs and information networks.

Where we at?

The series of publications aims to cover the following topics:

Graph
Directed Graph
Information networks
Motivations behind GNNs
Node embedding
The vanilla GNN
Graph Convolutional Networks (GCNs)
Graph Attention Networks (GATs)
Learning on an heterogeneous graph

This part will entirely focus on the fourth topic, the motivations behind GNNs.

Motivations behind GNNs

Why GNNs? I think there is no better way to introduce their purpose than with a motivation case! Once the motivation case will be established, I will walk you through different approaches that could have been used to tackle the problem and will analyze their advantages and disadvantages. Spoiler alert! The first approaches might have disadvantages that GNNs solve.

The approaches that will be discussed are listed here:

Motivation case

Here you are, doing some analyses about a social network. The network is pretty simple since it is only constituted of users that are connected to each others via friendship links. Put another way, it is an homogeneous information network where the only element in the set of possible object types (cursive A) is “user”, and the only element in the set of possible link types (cursive R) is “friendship”. In addition to this, you know that each user has two numerical facts describing them (a vector with two numerical features).

One of your annoying colleague doing analyses on the same social network comes by, interrupting you while you’re enjoying a sip of your coffee, and says: “I just sent a survey to all the users in order to ask them wether they like dogs or not. Everyone answered except the fourth one. Do you have any idea how we could predict his answer?”. Fueled by your love for dogs and your passion for research, you decide to start this new quest of answering your colleague’s question.

What are the possible ways to approach this problem knowing the structure of the graph, the features of the users and the fact that users 1 and 5 do not like dogs (t1 = t5 = 0) while users 3 and 2 do like them (t3 = t2 = 1)?

Features-based methods

One option for this kind of situation would be to consider what I call a “features-based method”. By features-based method, I mean a solution that considers only features related to the objects in the graph and hence makes abstraction of the links between them.

To only enumerate a few of them, you can consider methods such as:

The K-nearest neighbors algorithm (KNN)
Training a feedforward neural network (FNN)

In the motivation case, using the KNN algorithm, you could simply find the k-nearest users to the user 4 according to the distances you calculated between them using their numerical features and your favorite metric (e.g. euclidian distance, mahalanobis distance, cosine similarity, etc.). You would then make these k-nearest users, who answered the survey, vote to decide wether the user 4 likes dogs or not. The majority among the count of 0’s and 1’s would win!

Now, if you are familiar with the topic of neural networks, you might wonder why I suggested to train a FNN as a potential solution for our problem since such model (e.g. multi-layer perceptron) would not be a really convenient option in our toy problem considering the lack of training data.

Well, since this solution could be viable in a context with more users, I want to take a moment to present you the idea of a FNN from a perspective that you have probably never seen before. It will also be helpful to make deeper comparisons with GNNs later.

I like to see a FNN into two parts. The first part is a function (f) that takes the original features (x) as inputs and processes them through a series of layers of linear and nonlinear transformations. The second part, is another function (g) that takes the processed features (f(x)), and uses them in order to a make prediction about the label associated to the original input features. Here, a label refers to the target associated to an input.

Definitions of the functions composing a FNN.

At the end, both parts train and work together in order to make the best predictions. The function f will strive to encode enriched versions of the inputs, while the function g will do its best to use them to make the most accurate predictions.

For example, in the current context, f would learn how to take the two features vector (2D vector) of each user and turn it into another vector (not necessarily of the same length) that is potentially more meaningful to describe a user. From this new representations, g would learn how predict a value between 0 and 1 that represents the probability of a user to like dogs.

Visualization of a FNN as a composition of functions.

If the field of mathematics is not your cup of tea, you can have a look at the figure below where I pointed out the encoding function f and the prediction function g in a simple neural network with a single hidden layer.

Visualization of a simple FNN structure.

To conclude this section, let us have a look at the main advantages and disadvantages of features-based methods when we apply them to problems where the data takes the form of an information network where each object/node has features.

Advantages:

They are well documented on the web. You can easily find code snippets to help you achieve the task you want.
They are efficient when it is time to infer the label associated to a new input. We could quickly predict if a 6-th user likes dog if we have its features.

Disadvantage:

They do not consider the existing relations between objects in the graph.

Graph-based methods

A second option for our motivation case would be to adopt a “graph-based method”. This means, we could assume that connected users are more likely to have the same labels and only care about the graph structure behind our network.

For this section, we will focus entirely on a technic called label propagation. The idea behind this technic is simply to infer the missing labels in the graph by propagating the labels of the surrounding labeled neighbors. There are many ways we could do it in the current motivation case. However, I will focus on a single recipe since label propagation is not the main topic of this post.

Let us say that the label associated to each user in the social network represents its probability to like dogs. This makes sense since we used a 0 to identify people who do not like dogs and a 1 otherwise. Our first step could be to set the same temporary label (delta) to all the users who did not answer to the survey. Here for example, we will say that everyone who did not answer to the survey has as much chance to dislike or like dogs and assign them 0.5. Nonetheless, you could set another value between 0 and 1 if you have prior knowledge that makes you think a different number is more accurate.

Then, from this graph, we can calculate the transition matrix (M).

As you can see in the figure below, the transition matrix of a directed graph is the product of the inverse of the out-degree matrix and the adjacency matrix. In other words, the transition matrix is like the adjacency matrix to which we have normalized each row in order that their elements sum to 1.

In order to better understand its meaning, you could interpret each element of the i-th row and j-th column of the transition matrix as the probability to randomly pick the node j among the neighbors of the node i.

Definition and visualization of the transition matrix.

Now, with all these ingredients in place, we can proceed to the following iterative label propagation process. The idea of this algorithm is to consecutively update the labels of the users by taking a weighted average of their own original labels and the current average of their neighbors’ labels. The number of iterations can be predefined or the process can be stopped when the absolute differences between the labels from two consecutive iterations are below a certain threshold. It is also possible to increase or decrease the importance of the original labels using different values of alpha for the weighted average.

An example of label propagation algorithm.

Once the algorithm will be done, you will be able to find the missing labels you were looking for among the resulting column matrix.

Here you can see the results obtained after two iterations. In this situation, we would expect the 4-th user to like dogs since 29/50 > 0.5.

In order to wrap up this section, let me identify the principal advantages and disadvantages of the graph-based methods.

Advantage:

They consider graph structural properties.

Disadvantages:

They do not consider the features related to the objects in the graph.
The label propagation model will no longer be available if new unseen nodes join the graph later. We will have to run the whole thing again.

Representation learning methods

A third way to tackle the problem of the motivation case and answer your colleague’s question would be to use a representation learning method.

The idea behind this type of method is to, for each node, learn a low-dimensional vectorial representation (an embedding z) that captures the structural properties of the graph and the information from the features at the same time. Said differently, these methods aim to find a way of representing each user with a series of numbers (an embedding z) that combines meaningful information about its features and its position in the graph.

I like to divide the representation learning methods into 2 main categories:

1- Graph auto-encoders

2- Skipgram-based algorithms

The first category contains strategies that focus on learning nodes representations (theta) considering only the graph structure. These representations can then be concatenated to the original features in order to have the final embeddings (z) that can be taken in charge using a features-based method that we visited earlier.

The second category contains strategies that use an adaptation of a language model named Skipgram in order to produce representations of the nodes considering the graph structure and the features all at once.

Although this last category is really interesting and I strongly suggest you to read about methods such as DeepWalk and Node2Vec, I will only explain the functioning of graph auto-encoders for the remaining part of this section.

A graph auto-encoder is composed of two parts, which are the encoder and the decoder. The encoder is a function that has the purpose to map each node to a vectorial representation. The decoder is a function that aims, from the vectorial representation of a node, to recover the complete row that was originally associated to it in the adjacency matrix of the graph.

Encoder and decoder of a graph auto-encoder.

A good example of a graph auto-encoder is the Structural Deep Network Embedding (SDNE) where the encoder and the decoder are neural networks.

Visualization of SDNE’s encoder and decoder.

In order to train both neural networks together, the loss function is constituted of two parts presented below.

The first part allows the model to train with the objective that two connected nodes must give similar outputs from the encoder. The second part allows the model to train with the objective that the decoder must be able to find the most accurate adjacency matrix row from an encoding.

Here, as I also did with the other methods, I am presenting the advantages and disadvantages of the representation learning methods.

Advantages:

Combine the original features and the network structural information.
The training of functions generating latent representations is independent from the task and the new embeddings (z) can be reused for different purposes.

Disadvantages:

They do not learn an embedding method that generalizes to new unseen nodes joining the graph.
The embeddings could be more meaningful if they were learnt precisely for our task.

GNNs

Now that we have looked at many methods that could have been employed to handle the motivation case and have seen their pros and cons, we are now able to dress a list of criterions that should be respected by the model of our dreams. Well, maybe not your dreams, but still the model you would enjoy having around to get you through this journey!

The criterions are:

the model should consider both the features and the network so that when we are predicting the label of a node, we can benefit from the informations of similar/connected other nodes.
once trained, the model should be able to generalize to other unseen nodes without training the whole thing again.
the model should have a name that makes it sound cool when you discuss about it during meetings.

But how can we do that? As explained in the figure below, we could take the embedding idea of the representation learning methods to another level to have a perfect recipe!

The first step could be to build a function (q) that creates node embeddings that encapsulate information from their own features and also their neighbors’ features. Hence, q would consider both the features and the graph structural properties.

Expectations from the GNN’s embedding function.

Then, the second step would be to connect this fantastic function to the original FNN framework presented earlier so that q can be trained to generate task-specific embeddings.

Visualization of a GNN as a composition of functions.

Well, I have a good news for you. This delicious recipe exists and is called a graph neural network (GNN)!

Now that you are aware of the motivations behind GNNs, I invite you to read my next post where we will explore different GNN models, starting from the the Vanilla GNN and then make our way to the Graph Attention Network (GAT).

Before you go

Thank you for your reading! Feel free to visit my LinkedIn page.

References

Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. “Structural Deep Network Embedding”. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16). Association for Computing Machinery, New York, NY, USA, 1225–1234. DOI:https://doi.org/10.1145/2939672.2939753

Grover, Aditya, and Jure Leskovec. “Node2vec: Scalable Feature Learning for Networks.” ArXiv:1607.00653 [Cs, Stat], July 2016, http://arxiv.org/abs/1607.00653.

Huang, Qian, et al. “Combining Label Propagation and Simple Models Out-Performs Graph Neural Networks.” ArXiv:2010.13993 [Cs], Nov. 2020, http://arxiv.org/abs/2010.13993.

Perozzi, Bryan, et al. “DeepWalk: Online Learning of Social Representations.” Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining — KDD ’14, ACM Press, 2014, pp. 701–10, doi:10.1145/2623330.2623732.