Graph Neural Networks Series | Part 4 |The GNNs, Message Passing & Over-smoothing

Published in

The Modern Scientist

15 min readMay 16, 2023

Introduction

In the first part of this series, we focused on learning low-dimensional embeddings for nodes in a graph using shallow embedding approaches. These approaches optimized individual embedding vectors for each node. However, in this article, we shift our attention to more complex encoder models specifically designed for graph data.

Graph Neural Networks (GNNs) serve as a general framework for defining deep neural networks on graph-structured data. The main idea behind GNNs is to generate node representations that consider both the graph’s structure and any available feature information.

Optional point
One possible approach is to use the adjacency matrix as input for a deep neural network. For example, we can flatten the adjacency matrix A and feed it into a Multi-Layer Perceptron (MLP) to obtain an embedding of the entire graph: z_G = MLP(A[1] ⊕ A[2] ⊕ … ⊕ A[|V|]) Here, A[i] represents a row of the adjacency matrix A, and ⊕ denotes vector concatenation.
z_G = MLP(A[1] ⊕ A[2] ⊕ … ⊕ A[|V|])
z_G: The embedding vector representing the entire graph.
MLP: Multi-Layer Perceptron, a type of neural network architecture.
A[i]: The i-th row of the adjacency matrix A, which represents the connections between nodes in the graph.
⊕: Vector concatenation operator, used to combine the row vectors into a single input vector for the MLP.
|V|: The total number of nodes in the graph.
However, this approach relies on the arbitrary ordering of nodes in the adjacency matrix, making it non-permutation invariant.
To address this limitation, we aim for permutation invariance or equivariance in the model. Permutation invariance means that the function’s output remains the same regardless of the ordering of rows or columns in the adjacency matrix: f(PAP^T) = f(A) (Permutation Invariance)
Equation 5.2: f(PAP^T) = f(A) (Permutation Invariance)
f: The function that takes the adjacency matrix A as input and produces an output.
P: Permutation matrix that represents the reordering of rows/columns in the adjacency matrix.
^T: Transpose operation, used to transpose the adjacency matrix.
Permutation equivariance ensures that the output is consistently permuted when the adjacency matrix is permuted: f(PAP^T) = Pf(A) (Permutation Equivariance)
Here, P is a permutation matrix.
f: The function that takes the adjacency matrix A as input and produces an output.
P: Permutation matrix that represents the reordering of rows/columns in the adjacency matrix.
Pf(A): The output of the function f after applying the permutation P to the output.
By designing models that exhibit permutation equivariance or invariance, we can effectively capture the structure and features of graphs, enabling powerful deep learning techniques for graph data analysis.

Why are GNNs “Hard” ?

Unfortunately you cannot just throw jello on the wall and hope it sticks, that is just not how GNNs work. Developing complex encoders for graph data poses a challenge because traditional deep learning techniques are not directly applicable. Convolutional Neural Networks (CNNs) work well on grid-structured inputs like images, while Recurrent Neural Networks (RNNs) are suitable for sequential data such as text. To define deep neural networks on general graphs, a new type of architecture is needed.

Neural Message Passing

Neural Message Passing is a crucial concept in Graph Neural Networks (GNNs) because it enables information exchange and aggregation among nodes in a graph. In many real-world applications, data is structured as graphs, where entities are represented as nodes, and relationships between entities are represented as edges. Examples include social networks, biological networks, recommendation systems, and knowledge graphs.

The key challenge in modeling graph-structured data is effectively capturing the dependencies and interactions among nodes. Traditional deep learning techniques, such as convolutional neural networks (CNNs) for images or recurrent neural networks (RNNs) for sequences, are not directly applicable to graphs. Graphs lack the inherent grid structure of images or the linear order of sequences.

Neural Message Passing addresses this challenge by providing a framework for modeling dependencies and interactions in graph data. It allows nodes to exchange information with their neighboring nodes and aggregate that information to update their own representations. This message passing process is akin to nodes in a graph exchanging messages or signals, hence the name “Neural Message Passing.”

By passing messages through the graph, GNNs can leverage the local neighborhood information of each node to refine and update their representations. This enables GNNs to capture both the structural information of the graph and the features associated with each node. As a result, GNNs can learn rich and expressive representations of nodes that incorporate both local and global context.

Neural Message Passing has gained significant attention in the field of graph representation learning due to its ability to model complex relationships and dependencies in graph-structured data. It provides a powerful framework for solving various graph-related tasks, such as node classification, link prediction, graph classification, and recommendation.

At each iteration of the message passing in a GNN, a hidden embedding h(k)_u is updated for each node u based on information gathered from its graph neighborhood N(u). This update can be expressed as follows:

h(k+1)_u = UPDATE(k)(h(k)_u, AGGREGATE(k)({h(k)_v, ∀v ∈ N(u)}))

Here, UPDATE and AGGREGATE are arbitrary differentiable functions, typically implemented as neural networks. The AGGREGATE function takes the embeddings of nodes in the neighborhood N(u) and generates a message m(k)_N(u) based on this aggregated information. The UPDATE function combines the message m(k)_N(u) with the previous embedding h(k-1)_u to produce the updated embedding h(k)_u. The initial embeddings at iteration k=0 are set to the input features xu for all nodes u. After K iterations of message passing, we can use the final layer output to define the embeddings for each node zu.

It’s important to note that GNNs are permutation equivariant by design, as the AGGREGATE function operates on a set of node embeddings, making the model invariant to the ordering of nodes.

Regarding node features, GNNs require input node features xu, ∀u ∈ V, unlike the shallow embedding methods discussed earlier in the series. In cases where rich node features are available, such as gene expression features or text features, they can be utilized. However, if no node features are present, alternative options can be employed. One option is to use node statistics

Pseudocode

Here are the steps of the Neural message passing broken down.

Each node in the graph receives an initial embedding, which can be the node’s input features.

PyTorch Geometric: Assign initial node embeddings using the x attribute of the Data object.

During each iteration of message passing:

The node aggregates information from its neighboring nodes.
PyTorch Geometric: Use the MessagePassing class to define the message passing operation and aggregate messages from neighboring nodes.

This aggregated information is combined with the node’s current embedding using an update function.

PyTorch Geometric: Implement the update function within the MessagePassing class to combine the aggregated messages with the current node embeddings.

The node’s embedding is updated based on the combined information.

PyTorch Geometric: Implement the message function within the MessagePassing class to update the node embeddings based on the combined information.

The updated embeddings are passed to the next iteration of message passing.

PyTorch Geometric: Repeat the message passing process for multiple iterations using the propagate method of the MessagePassing class.

After a certain number of iterations, the final embeddings are used to represent the nodes in the graph.

PyTorch Geometric: Retrieve the final node embeddings from the x attribute of the Data object.

The aggregation, update, and iteration steps are performed by neural networks, which can learn to capture complex patterns in the graph.

PyTorch Geometric: Define the neural network models for the aggregation and update functions, which will be used within the MessagePassing class.

The resulting node embeddings can be used for various downstream tasks, such as node classification or link prediction.

PyTorch Geometric: Utilize the generated node embeddings for downstream tasks, such as feeding them into a classifier or predictor for node-related tasks.

Code

This code looks different but it is the same concept do not be alarmed.

import torch
import torch.nn as nn

class GNNLayer(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(GNNLayer, self).__init__()
        self.update_fn = nn.Linear(input_dim, output_dim)
        self.aggregate_fn = nn.Linear(input_dim, output_dim)    def forward(self, h, adj_matrix):
        messages = torch.matmul(adj_matrix, h)  # Aggregating messages from neighbors
        aggregated = self.aggregate_fn(messages)  # Applying the aggregate function
        updated = self.update_fn(h) + aggregated  # Updating the node embeddings
        return updatedclass GNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers):
        super(GNN, self).__init__()
        self.layers = nn.ModuleList()
        self.layers.append(GNNLayer(input_dim, hidden_dim))
        for _ in range(num_layers - 2):
            self.layers.append(GNNLayer(hidden_dim, hidden_dim))
        self.layers.append(GNNLayer(hidden_dim, output_dim))    def forward(self, features, adj_matrix):
        h = features.clone()
        for layer in self.layers:
            h = layer(h, adj_matrix)
        return h# Example usage
input_dim = 16
hidden_dim = 32
output_dim = 8
num_layers = 3# Generating random node features and adjacency matrix
num_nodes = 10
features = torch.randn(num_nodes, input_dim)
adj_matrix = torch.randn(num_nodes, num_nodes)# Creating a GNN model
gnn = GNN(input_dim, hidden_dim, output_dim, num_layers)# Forward pass through the GNN
embeddings = gnn(features, adj_matrix)print("Node embeddings:")
print(embeddings)

In this code, we define a simplified GNN with GNNLayer as the building block. Each GNNLayer consists of an update function and an aggregate function implemented as linear layers (nn.Linear). The forward pass of GNNLayer takes the node embeddings h and the adjacency matrix adj_matrix as inputs, aggregates messages from neighboring nodes, applies the aggregate function, and updates the node embeddings.

In the provided implementation, the update_fn and aggregate_fn functions in the GNNLayer class are linear neural networks (nn.Linear modules) instead of the simple addition operation used in the previous examples. This is because the linear layers allow for more flexibility and capacity in learning the transformation functions during the message passing process.

Using linear layers enables the GNN to learn complex non-linear relationships between nodes and their neighborhoods. The linear transformation applied by the neural networks allows the model to capture higher-order interactions and capture more intricate patterns in the graph structure.

By using neural networks as the update and aggregate functions, the GNN can adapt and learn different transformation functions for different layers. This flexibility allows the model to capture more expressive representations of the nodes and better model the graph structure.

using linear neural networks in the update and aggregate functions of the GNN allows the model to approximate the underlying patterns and relationships present in the graph data. By using neural networks, the GNN can learn non-linear transformations that can better capture the complex structure and interactions within the graph. The GNN class represents the entire GNN model composed of multiple layers. In the forward pass, the node embeddings are propagated through the GNN layers, and the final layer's output represents the node embeddings.

To demonstrate the usage, we create random node features and an adjacency matrix. Then, we instantiate a GNN model with the specified input, hidden, and output dimensions and the number of layers. Finally, we pass the features and adjacency matrix through the GNN model, resulting in node embeddings.

In the example code provided, the number of layers in the GNN model is determined by the num_layers parameter.

In the given code snippet, the num_layers is set to 3, which means there are 3 GNN layers in total. The first layer takes the input node features and transforms them to the hidden dimension (hidden_dim). The subsequent layers take the hidden dimension as input and apply the GNN operations iteratively until reaching the final layer, which outputs node embeddings of dimension output_dim.

You can adjust the value of num_layers to add or remove layers in the GNN model according to your specific needs.

Node vs. graph-level equations provide different ways to define operations in graph neural networks (GNNs). Here’s a simplified explanation

Node-level equations: In GNNs, we often define the core message-passing operations at the node level. This approach involves each node aggregating information from its neighbors.

import torch
import torch.nn as nn

class GNNLayer(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(GNNLayer, self).__init__()
        self.update_fn = nn.Linear(input_dim, output_dim)
        self.aggregate_fn = nn.Linear(input_dim, output_dim)    def forward(self, h, adj_matrix):
        messages = torch.matmul(adj_matrix, h)  # Aggregating messages from neighbors
        aggregated = self.aggregate_fn(messages)  # Applying the aggregate function
        updated = self.update_fn(h) + aggregated  # Updating the node embeddings
        return updated# Example usage
input_dim = 16
output_dim = 8
num_nodes = 10
features = torch.randn(num_nodes, input_dim)
adj_matrix = torch.randn(num_nodes, num_nodes)gnn_layer = GNNLayer(input_dim, output_dim)
embeddings = gnn_layer(features, adj_matrix)

Graph-level equations: Some GNNs can be defined using graph-level equations, which provide a more concise representation. These equations operate on the entire graph rather than individual nodes.

import torch
import torch.nn as nn

class GNN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(GNN, self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)    def forward(self, h, adj_matrix):
        updated = self.linear(torch.matmul(adj_matrix, h))
        return updated# Example usage
input_dim = 16
output_dim = 8
num_nodes = 10
features = torch.randn(num_nodes, input_dim)
adj_matrix = torch.randn(num_nodes, num_nodes)gnn = GNN(input_dim, output_dim)
embeddings = gnn(features, adj_matrix)

In both examples, the node-level and graph-level equations are implemented using PyTorch. The node-level equation involves aggregating messages from neighbors and updating node embeddings using neural network layers. The graph-level equation directly operates on the entire graph by applying linear transformations.

self-loops

Adding self-loops to the input graph and simplifying the message passing by aggregating information from the node’s neighbors as well as the node itself can be helpful for several reasons:

Implicit Update: With self-loops, there is no need for an explicit update function. The update step is implicitly defined through the aggregation method. This simplification can make the model more concise and alleviate overfitting.
Parameter Sharing: Adding self-loops in the case of the basic GNN is equivalent to sharing parameters between the matrices responsible for self-information (Wself) and neighbor information (Wneigh). This parameter sharing can reduce the number of trainable parameters in the model and improve computational efficiency.
Regularization: Self-loops can provide regularization effects by incorporating the node’s own information in the aggregation process. This regularization can enhance the model’s generalization capabilities and prevent overfitting.

However, it’s important to note that this simplification comes at the cost of limiting the expressivity of the GNN. By treating the node’s own information and the neighbor information equally, the model may struggle to differentiate between the two sources of information. This limitation should be considered when applying the simplified message passing approach.

import torch
import torch.nn as nn

class GNN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(GNN, self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)    def forward(self, h, adj_matrix):
        adj_matrix_with_self = adj_matrix + torch.eye(adj_matrix.size(0))
        updated = self.linear(torch.matmul(adj_matrix_with_self, h))
        return updated# Example usage
input_dim = 16
output_dim = 8
num_nodes = 10
features = torch.randn(num_nodes, input_dim)
adj_matrix = torch.randn(num_nodes, num_nodes)gnn = GNN(input_dim, output_dim)
embeddings = gnn(features, adj_matrix)

In this modified code, adj_matrix_with_self is created by adding an identity matrix (torch.eye(adj_matrix.size(0))) to the original adjacency matrix (adj_matrix). This addition of self-loops ensures that each node includes its own information during the message passing process.

You can consider adding self-loops to the adjacency matrix when you want to incorporate a node’s own information into the message passing process. Here are a few scenarios where adding self-loops can be beneficial:

Node Self-Information: If the information of a node itself is relevant and should influence its own embedding, adding self-loops allows the node to aggregate information from its neighbors as well as itself. This can be useful when the node’s own attributes or characteristics play a significant role in the task at hand.
Regularization: Adding self-loops can act as a form of regularization by including the node’s own information during message passing. This can help prevent overfitting and enhance the model’s generalization capabilities.
Graph Structure Importance: In certain graph-based tasks, capturing the overall graph structure and preserving connectivity patterns is crucial. By including self-loops, you ensure that each node considers its own connections when aggregating information, leading to better preservation of the graph structure.

It’s important to note that not all GNN architectures or tasks require the addition of self-loops. It depends on the specific characteristics of your dataset and the nature of the task you are trying to solve. It is often a hyperparameter choice that you can experiment with and evaluate based on the performance and behavior of your model.

Neighborhood aggregation methods

The basic GNN model can be improved and generalized by exploring different neighborhood aggregation methods. In this section, we discuss the concept of neighborhood normalization to address the instability and sensitivity to node degrees in the basic aggregation operation. Normalization ensures that the aggregation is not biased towards nodes with higher degrees.

Neighborhood Normalization:

The basic aggregation operation sums the neighbor embeddings but can be unstable and sensitive to node degrees.
One solution is to normalize the aggregation based on the degrees of the nodes involved.
The simplest approach is to take an average of the embeddings.
Another successful normalization method is symmetric normalization, which considers the degrees of both the node being aggregated and its neighbors.

Graph Convolutional Networks (GCNs):

One popular GNN model, the graph convolutional network (GCN), employs symmetric normalization and self-loop update.
The message passing function in GCN combines symmetric-normalized aggregation and an elementwise non-linearity.

To Normalize or Not to Normalize?

Proper normalization is essential for stable and strong GNN performance.
However, normalization can lead to a loss of information and can obscure structural graph features.
The decision to use normalization depends on the specific application and the balance between the importance of node features and structural information.

Code examples:

import torch
import torch.nn as nn

class GNN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(GNN, self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)    def forward(self, h, adj_matrix):
        normalized_agg = torch.matmul(adj_matrix, h) / adj_matrix.sum(dim=1, keepdim=True)
        updated = self.linear(normalized_agg)
        return updated# Example usage
input_dim = 16
output_dim = 8
num_nodes = 10
features = torch.randn(num_nodes, input_dim)
adj_matrix = torch.randn(num_nodes, num_nodes)gnn = GNN(input_dim, output_dim)
embeddings = gnn(features, adj_matrix)

In this example, we normalize the aggregation by dividing the adjacency matrix by the sum of each node’s row. This ensures that the aggregation is normalized based on the degrees of the nodes involved. The normalized aggregation is then passed through a linear layer to update the node embeddings.

Generalized Neighborhood Aggregation:

The AGGREGATE operator in GNN models can be improved and generalized.
Neighborhood normalization can be used to address instability and sensitivity to node degrees.
Set pooling, based on permutation invariant neural networks, can provide a more sophisticated aggregation function.
Janossy pooling, using permutation-sensitive functions, is another alternative approach that is more powerful than simple sum or mean aggregations.
Neighborhood attention, assigning attention weights to neighbors, is a popular strategy to improve the aggregation layer in GNNs.
GNN models with multi-headed attention are closely related to the transformer architecture used in NLP and computer vision.

Over-smoothing and Neighborhood Influence:

Over-smoothing is a common issue in GNNs where node representations become too similar after multiple iterations, limiting the ability to capture longer-term dependencies.
The influence of a node’s input feature on the final embeddings of other nodes can be quantified using the Jacobian matrix.
The influence of a node in a GNN with self-loop update or basic GNN models decreases as the number of layers increases, leading to over-smoothing.
Deeper models can hurt performance as they lose information about local neighborhood structures and embeddings become over-smoothed.

Over-smoothing and neighborhood influence in GNNs can arise in various scenarios. Here are a couple of examples:

Social Network Analysis:

Scenario: Analyzing a social network to predict user preferences or behaviors.
Over-smoothing: As the GNN iteratively aggregates information from the social connections of each user, the representations of different users become increasingly similar, losing individuality and specific characteristics.
Neighborhood Influence: The influence of a user’s initial features (e.g., demographics, interests) on the final representations of other users decreases as the number of GNN layers increases. This means that local neighborhood information becomes less influential, and the learned embeddings become more influenced by the overall structure of the social network.

Recommendation Systems:

Scenario: Building a recommendation system based on user-item interaction data.
Over-smoothing: After multiple iterations of GNN message passing, the representations of items become highly similar, leading to difficulties in distinguishing between different items. The recommendations may become generic and fail to capture the nuances and specific characteristics of individual items.
Neighborhood Influence: The influence of an item’s initial features (e.g., genre, popularity) on the final representations of other items diminishes with increasing GNN layers. This means that the contributions of nearby items in the user-item interaction graph become less significant, potentially leading to inadequate recommendations for niche or less-connected items.

In both scenarios, over-smoothing and neighborhood influence can limit the expressive power of the GNN model, making it challenging to capture fine-grained details, individual differences, and local relationships within the graph data.

Supplementary points

Over-smoothing is a common issue in GNNs where node-specific information is lost after multiple iterations of message passing.
Concatenation and skip-connections are techniques used to alleviate over-smoothing by preserving information from previous rounds of message passing during the update step.

import torch
import torch.nn as nn

class GNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GNN, self).__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim + input_dim, output_dim)    def forward(self, h, adj_matrix):
        # Initial node representations
        h0 = self.linear1(h)
        
        # Message passing with concatenation-based skip connections
        m = torch.matmul(adj_matrix, h)
        h1 = self.linear2(torch.cat((h, m), dim=1))
        h_updated = h1 + h0  # Skip connection: add previous layer representation
        
        return h_updated

In this example, we define a GNN module with two linear layers: linear1 and linear2. In the forward method, we perform the initial linear transformation (linear1) on the input node features h to obtain h0.

During message passing, we calculate the aggregated messages m by multiplying the adjacency matrix adj_matrix with the current node representations h. Then, we concatenate the current node representations h with the aggregated messages m along the feature dimension using torch.cat. The concatenated tensor is passed through the second linear layer (linear2) to obtain h1.
To preserve information from the previous round of message passing, we add a skip connection by adding the initial node representations h0 to the updated node representations h1. This helps in alleviating over-smoothing and ensures that information from previous rounds is retained.
Concatenation-based skip connections involve concatenating the output of the base update function with the node’s previous-layer representation.
Linear interpolation skip connections interpolate between the previous representation and the updated representation based on neighborhood information.
These skip-connection methods help disentangle information and improve the numerical stability of optimization, allowing for deeper GNN models.
Gated updates apply techniques from recurrent neural networks (RNNs) to update the hidden state of each node based on observations from the neighbors.
Gated updates are effective at preventing over-smoothing and facilitating deep GNN architectures, particularly in tasks requiring complex reasoning over the graph structure.
Jumping knowledge connections leverage representations at each layer of message passing, allowing for improved final node representations by combining representations from multiple layers.
Adding jumping knowledge connections is a useful strategy to employ, where node embeddings from each layer are concatenated or combined using different functions to enhance performance across various tasks.