Scaling GNNs with Graph Rewiring

vishnu pradeep
8 min readMar 12, 2022

--

An intuitive guide into rewiring Graph Neural Networks with Ricci Flows

Prerequisites: General concepts of GNNs, Understanding of spectral graph theory. This article is aimed at machine learning engineers.

In the most of classical deep learning; models showed an increase in performance with addition of more layers. In Graph neural networks however addition of more layers shows a substantial drop in performance compared to shallow models . Ideally ,we would need more layers to model complex relationships, long-range information. When we add more layers to a GNN each node now has access to information from other nodes that are farther away and might help model important long-range dependencies. But this doesn’t work as expected. This is due to over-squashing in GNNs. Lets define it formally.

The distortion of information flowing from distant nodes as a factor limiting the efficiency of message passing for tasks relying on long-distance interactions; is referred to as ‘over-squashing’.

Now let’s look at a formulation from our existing knowledge of GNNs (MP-GNNs) to understand the problem better. Let us consider the (l + 1)-th layer output of a generic MPNN as follows :

( following the message functions Ψ , update function Φ ; we have considered the augmented normalized adjacency matrix to propagate messages from each node to its neighbors, which simply leads to a degree normalization of the message functions ψ )

Ψ , Φ functions are Lipschitz bounded by α ,β:

|∇φ| ≤ α;|∇ψ | ≤ β at layer l

consider a node i and another node s ; r nodes away from it .

The over-squashing of information can then be understood in terms of one node representation of node i failing to be affected by some input feature of node s at distance r from node i. To formulate this we use the jacobian :

Smaller values of the jacobian lead to poor information propagation. Since the gradients of Ψ , Φ are bounded ; the value of jacobian is controlled by the value of  (powers of Â). Which means; it is the structure of the graph that determines the over-squashing behavior of the nodes. We refer to those structural properties of the graph that lead to over-squashing as a bottleneck.

Graph rewiring attempts to produce a new graph with a different edge structure that reduces the bottleneck and hence potentially alleviates the over-squashing of information.

The over-squashing problem in graphs can be solved effectively; by the concept of Ricci curvature from Riemannian geometry and by utilizing the idea of reverse Ricci flows. Let’s understand this in detail :

Let us start by understanding the notion of sectional curvature from the diagram below :

Sectional curvature (a curvature where the parallel displacement is restricted to a 2-dimensional facet) helps you to understand the curvature of a Riemannian manifold restricted to a lower-dimensional space. When we average sectional curvature values over all directions w over the entire manifold we obtain Ricci curvature ie: Ricci curvature is the average of sectional curvature over all facets.

Ricci curvature can be thought of as a curvature that quantitatively measures how space is curved.

But how do you use these ideas with graphs? We need to define a discrete mapping of Ricci curvature to graphs.

Consider an analog of the xy vector from the continuous case ( from figure 5) in graph domain. In the case of graphs xy is an edge.

For every edge xy ; consider the distance from x’s neighbors to y’s neighbors and compare it with the length of xy . To compare the neighbors of x and y; we consider distributions µ1 and µ2 on x’s and y’s neighbors respectively. We then use optimal transport distance to find the distance from x’s neighbors to y’s neighbors.

Lets make it simple; consider the distance between nodes x and y connected by edge xy ; now look at how much has the neighbors of x or y have either converged close to each other or have moved farther in comparison to the initial distance xy.

Lets take an example graph to understand this : consider a graph with nodes x0,x1,x2,x3,x4 and y0,y1,y2,y3,y4 . There exist an edge x0 y0 (the curvature is found with respect to this edge). When we transport the neighborhood of x0 ie:{ x1,x2,x3,x4 }red to neighborhood of y0 ie:{ y1,y2,y3,y4 } blue. The total distance to transport neighborhood of x0 to neighborhood of y0 is 1; ie, we need to move each corresponding node by one unit. This is the same as the initial distance between x0 and yo (unit distance). Which means; the curvature value is zero.

In such a uniform case the curvature is zero. In the case of a tree shaped graph the neighborhood nodes diverge ; resulting in a negative curvature. There is also an exponential growth of volume in such graphs involving trees.

In a discrete spherical geometry,the edges would meet to form a triangle corresponding to positive curvature. In a euclidean case, the edges would stay parallel to form 4-cycles and correspond to zero curvature. In the discrete hyperbolic geometry case, the mutual distance of the edge endpoints would grow corresponding to a tree-like geometry or negative curvature.

Now let us understand Ricci flows; we look at the notion of Ricci flow as a ccurvature-guided process.

Under the Ricci flow, regions in the manifold of positive sectional curvature tend to shrink and regions of negative sectional curvature tend to expand and spread out ; the analogue holds good to graphs where edges of positive curvature tends to come closer and edges of negative curvature moves apart.

Over the years mathematicians and computer science researchers have proposed various approaches to defining Ricci curvature on graphs. The definition we started with used a notion of optimal transport based on Ollivier’s original formulation. For the sake of GNNs we are more interested in a different formulation of Ricci curvature. Our focus is to reach the formulation proposed by Graph Deeplearning OG Michael Bronstein’s team last year.

Geometrically, Ricci curvature controls how fast the volume of an object grows. It also controls the volume of the overlap of two objects in terms of their radii and the distance between their centers. From Ollivier’s original formulation using optimal transport; lets consider two balls in Riemannian geometry , suppose they overlap heavily; then it is much easier to transport the mass from one ball to the other : ie the Wasserstein distance is lower . Analogously, in the graph case, when the two vertices share many triangles, then the transportation distance should be smaller, and the curvature therefore correspondingly larger . When we formulate a transport plan ,the existence of triangles including neighboring vertices would save a lot of transport costs. A possible explanation of using triangles as given in early papers is: on a graph, the analog of geodesics starting in different directions, but eventually approaching each other again, would be a triangle. Wasserstein distance can also be explained in terms of overlap of quadrilaterals, pentagons etc.

Since we have covered the earlier formulations let’s dive into the details of this recent work [1]. This work proposes a balanced forman curvature improving upon the existing formulations :

Triangles can be related to positive curvature, four-cycle to correspond to zero curvature and the rest of the outgoing edges as related to a negative curvature ( as in a tree). This intuition was used to come up with this formulation as stated in the paper.

The edges that have a negative curvature are responsible for the over-squashing problem. These edges are analogous to bridges. For nodes in two parts of the network to interact the messages need to go through these bridge like edges. This can also be extended to the notion of cheeger constant. Cheeger constant of a graph is a numerical measure of whether or not a graph has bottleneck. Ideally we need large cheeger values ensuring that there are less number of bottlenecks and therefore low chances of failure of the network if a few nodes fail. A large cheeger value indicates lesser bottlenecks being present and a small cheeger value indicates more chances of bottlenecks.

The graph rewiring mechanism:

‎The paper proposes Stochastic Discrete Ricci Flow to rewire the graph. This follows the Ricci flow formulation from the paper ;analogous to continuous domain. Here edges are added by taking into consideration of the curvature values. Graph with edge that is added is compared with initial graph at every iteration . The edges are retained or removed based on curvature values . At each iteration this preprocessing step adds an edge to support the graph’s most negatively curved edge, and then removes the most positively curved edge.‎ The algorithm is given as:

Simply using the Ricci flow metric comes with some notable advantages.

While using spectral embeddings ; removal or addition of a node leads to chaotic behavior of the graph ; use of Ricci flow metric eliminates this problem . Using Ricci flow metric ; removal or addition of nodes is stable unlike spectral embeddings.

Ricci flow metric also does fairly well in reconstructing nodes if nodes are removed.

Another popular mechanism to graph rewiring is using Diffusion (DIGL -Diffusion Improves Graph Learning), but this method suffers from a few disadvantages as compared to the curvature-based method we discussed above.

DIGL has a limited effect on bottleneck . It changes the graph topology significantly. It tends to connect nodes at small diffusion distance which may be from a different group. It works well for homophilic case ( where node features/labels are aligned with graph structure) and fails in the case of heterophilic nodes by injecting noise.

[1] UNDERSTANDING OVER-SQUASHING AND BOTTLENECKS ON GRAPHS VIA CURVATURE 2111.14522.pdf (arxiv.org)

--

--

vishnu pradeep

Applied ML / ML Infra Engineer with strong bias for Production ML Engineering |Stack:Pytorch/AWS/GKE/Flink/Scala/Spark/CUDA,Triton/Kubeflow/TVM/DGL/PYG/k8s/IaC