Enhancing Graph Vanilla Neural with Convolutional Neural Networks, basis of GCN

AMA
14 min readJul 9, 2024

--

I’ve been busy lately getting my hands dirty with Gemini Pro and LLMs, and keeping up with the pace of ML and AI these days is a challenging task combined with writing insightful articles. That aside, I am delighted to continue our exploration into GNNs from our previous article on Vanilla neural networks. I hope you are as excited as me to dissect the Graph Convolutional Neural Network (GCN), which is the standard benchmark for most research when handling graph-like data.

Introduction to GCN

The GCN architecture was introduced by Kipf and Welling in 2017 to overcome the limitations of the Vanilla Graph Neural Network by borrowing some concepts from the Convolutional Neural Networks (CNN) architecture and applying them to graph data. As we all know, CNNs have been applied extensively in analyzing and processing imaging data and are foundational for most deep learning on images. We covered the vanilla graph neural network in the previous article with the Facebook Page-Page dataset; we will proceed to use the same dataset to explore how GCN architecture overcomes some of the limitations of the previous architecture.

Understanding the GCN Formula

Like most ML enthusiasts I like to always examine the formulas of most ML models but most of the time it looks daunting and confusing. However, when you understand the formula it saves you a great deal of time from reading about the architecture as the formula contains 1000 words in at most 20 characters. Below is the original formula of the GNN architecture.

Forward propagation equation for graph convolutional networks image credit Sid Arcidiacono

Though the formula seems self-explanatory, it becomes more intuitive when approached from the lens of linear algebra with visualization that illustrates some of these annotations and simplifies this complex long formula to the elements below which is less complex.

Simplifying the GCN Formula

I guess some of these formulas would not make sense to you now but as you read through we will piece them up together, and connect the dots in such a way that when you next look at the formula back you will be able to piece it together like a Lego brick which fits perfectly to make a castle. Let’s proceed to analyze how computer vision models like CNN influence the architecture of GCN, such that next you see the image of a connected graph, you will have a mental model of how Convolution Neural Networks are applied to graph data which is at the core of Graph Convolutional Neural Networks.

From CNNs to GCNs: An Analogy

Left: image in Euclidean space. Right: graph in non-Euclidean space. (Zhou et al. 2020)

From the image above, our goal is to teach a computer to make sense of data present in the form to the right based on how it already knows how it learned to analyze the image to the left. However, please note this is not transfer learning but simple we are using this analogy because it makes it easy to understand how GCN architecture is built using the CNN model. Let’s proceed to explore how Convolution Neural Networks influence the architecture of GCN in the following order

1. A brief review of convolutional neural networks

2. Explanation and implementation of graph convolutional layers

A Brief Review of Convolutional Neural Networks

Convolutional Neural Networks are widely used for pattern recognition for data in a grid-like structure like images. At the core of CNN are the convolutional layers which apply a set of learnable filters also known as kernels to map features as inputs. Below we have the main components of the CNN architecture.

i. Convolutional layers: Apply a set of learnable filters (kernels) to the input, producing feature maps.

ii. Activation layers (e.g. ReLU): Introduce non-linearity to the network.

iii. Pooling layers: Reduce the spatial size of the feature maps, providing some degree of translation invariance.

iv. Fully connected layers: Perform classification based on the learned features.

The CNN Kernel and Convolution Process

As a quick reminder, the CNN’s kernel, also known as the filter, is a small matrix of numbers used to extract certain features from the input image. The kernel functions by convolving across the input image, and a dot product is calculated between the kernel and the corresponding patch of the input image represented below with the gray-scale matrix. This enables the kernel to detect and learn the features anywhere in the input image with the CNN model through backpropagation and gradient descent.

If we look at the image of the dog below, a portion of the image is selected with a 3x3 grid structure representing the pixels in an image, and a kernel is applied to it, which is updated through backpropagation. The results are then summed to create a single value, which is then stored in the center of the output feature map, and this process is repeated for the other missing values shown in the image below.

Convolutional Neural Network, image credit Sorouch Mehraban

Mapping CNN Concepts to Graph Data

From the above diagram, we can look at the grid structure of the selected portion of the image above with the 3 X 3 grid matrix as a connected graph with the squares representing nodes all connected to the center graph. So if we were to proceed to apply convolution to a graph, how do we go about this? The answer is to simply look at the gridlike feature map of an image as a graph, where the square boxes are nodes connected by edges to the center node. This simple analogy helps us use CNNs to map graph data as interconnected pixels of an image as seen below.

Thus from the graph above we can observe all nodes are connected to the center node with the node values as pixel values. The weights of the edges are represented by the images of the kernel and when we proceed to do a dot product of the pixel image and kernels, we end up with the same result as that of CNN.

Applying Convolution to Graphs

Given that we now have a mental model of a graph from an image, where the nodes represent the pixels and the directed edge weights are the kernel values, this provides us with a graph network. So we could view this network as a friendship network, which we’ve used in previous articles to illustrate how our friends influence our success. Thus, the center node represents you, the individual, and the other nodes are your friends. So, we would proceed to calculate the weighted average of the neighboring nodes (your friends) and the strength of your relationship to determine your level of success. When we look at the Kernel values, we notice that the value of the center node 52 is not included in this weighted average calculation, which represents a successful iteration. This is a tricky aspect because this value represents the current state of that node and is not factored in. From the friendship network perspective, this represents your present level of success, represented by 52, with the weighted sum of the neighboring nodes being an indicator of your friend’s cumulative influence on your success. However, failure to include your present state is saying only your friend’s influence has an impact on your success without taking into account your present state. Thus, to take this into account, we proceed to add a self-loop by multiplying the value of the kernel 3.11 by the present state (52 * 3.11), and we now get the result.

Challenges with Non-Uniform Graph Structures

The illustrations above showed how CNN is mapped to graph-like data, which is simple information passing between nodes to the node itself, but now we have two problems:

i. First, only the center node is updated and none of the other node’s values are updated.

ii. Not all graphs are uniform structures as in the case above or have the same number of neighbors as in image or tabular data.

Graphs have different shapes and sizes known as the topology of a graph, and this would be quite challenging for us to model based on the previous assumptions, as seen with the different topologies below.

Applying GCN to Graphs with Varying Topologies

Let’s proceed to use the second graph above to illustrate how we can apply GCN to a graph with three nodes. In this case, our graph values would not be a single value but rather a 2-dimensional vector for each node. Let’s proceed to label the nodes in this graph as 1, 2, and 3 respectively. One crucial question is how we represent the relationship between these nodes in this graph. To represent the relationship between the nodes in this graph, we will use the adjacency matrix, which we covered in the second article of this series on GNNs (see Exploring Graph Properties). The adjacency matrix depicts the edge relationship between two nodes and other nodes or itself (self-loop), represented by 1 where there exists a relationship and 0 where there is none, denoted in the diagram below as AH. The two-dimensional vectors which represent the values of the nodes are also represented in the matrix H.

Going forward, we simply do matrix multiplication of AH, which is analogous to information passing between the nodes and itself, given that the adjacency matrix maps the relationships between the nodes and the self-loop to incorporate the present state. So we proceed to multiply the first row of matrix A by the first column of matrix H.

Going forward we simply do matrix multiplication of AH which is analogous to information passing between the nodes and itself given that the adjacency matrix maps the relationships between the nodes and the self-loop to incorporate the present state. So we proceed to multiply the first row of the matrix A by the first column of matrix H.

Matrix Multiplication and Information Passing

From now henceforth this is just a simple matrix multiplication operation to enable information to pass between connected nodes depicted in the adjacency matrix. From the operation below we observe that the result of N1 in green shows N1 + N2 which illustrates that N1 is connected to N2 and itself as seen in the adjacency matrix A. The same conclusion can be made for node N2 which is connected to nodes 1, node 2, and itself (self-loop).

Now, we then to treat all the channels and nodes the same. However, this is not the case with real graphs, and we have to account for the following two scenarios based on the following questions:

1. What if the information in some channels is more important than others? For example, is the information of one channel like the first channel containing 3, 8, and 5 more important than 7, -1, and 5?

2. What if the information of one node is more important than the others, like Node 1 with 3 and 7 being more important than Node 2 and Node 3?

To account for the effects of a scenario like the first one, where one channel is more important than the other, we simply proceed with a weighted multiplication, as seen below, and do backpropagation, which enables the algorithm to learn which channel is more important. For us to map from one space to another, we simply change the vectors in the weightings from two to three columns or shrink to one column and multiply AHW.

The Simplified GCN Formula

Well, my dear friend, this is graph convolutional neural AHW, which entails simply multiplying the adjacency matrix by the node values and weights. So I guess the initial formula at the beginning looks less confusing now from this perspective. Let’s break down the original formula based on a simplified formula.

This is the original formula seen above, which is comprised of the following variables:

1. A Tilda: Represents the adjacency matrix plus self (Node), which we explained earlier was to preserve the information of the node (current state).

2. D Tilda: On the other hand, is a sum of all the nodes that takes into account all neighbors of the nodes in the adjacency matrix. This is the main difference with the GNN layer we saw in the previous article. This is done to serve as a normalization of the node neighbors and thus mitigate the difference between nodes with higher neighbors and those with lower neighbors. By dividing the nodes by their neighbors, we help resolve this problem, with N1 being divided by 3, N2 by 2, and N3 by 2.

3. Sigma formula: The sigma formula at the beginning is simply the activation function ReLU, which adds nonlinearity to the algorithm.

Implementation and Performance Comparison

In our previous article where we introduced the basic GNN architecture, which took the graph features into consideration, we experienced a significant boost in performance compared to the Node2Vec model. So let’s proceed to implement the GCN model on the same dataset as earlier to determine its performance.

Graph Convolutional Network compared to Vanilla Graph Neural Network.

We used the Citeseer and Facebook Page-Page datasets to implement the Graph Linear Layer model in our previous article, thus we will proceed to use the same dataset for this comparison. We will analyze the node degrees in these datasets because the main difference between these two architectures is that GCN considers node degrees to weigh feature importance, as seen earlier. Thus, our expectation is for node degrees to vary significantly for this architecture to have good performance.

Implementing GCN with PyTorch on Citeseer and Facebook Page-to-Page Dataset

If you’ve made it thus far in this article, it’s time to get your hands dirty with some coding exercises where we take our learnings from the architecture and theoretical foundation of GCN into implementing this model in Pytorch. We would proceed to compare how this model compares to the vanilla GNN architecture from our previous article in this series on both the Citeseer and Facebook Page-to-Page datasets.

Setting Up the Environment

First, we need to install the necessary libraries. We’ll be using torch_geometric, a library that extends PyTorch to handle graph-related computations efficiently.

!pip install torch_geometric

Importing Necessary Libraries

We start by importing the necessary libraries and modules for our implementation.

from torch_geometric.datasets import Planetoid
from torch_geometric.utils import degree
from collections import Counter
import matplotlib.pyplot as plt

Loading the Citeseer Dataset

Next, we load the Citeseer dataset using Planetoid, a popular benchmark dataset in graph neural networks research.

# import Citeseer dataset
datasets = Planetoid(root=".", name="Citeseer")
data = datasets[0]

Analyzing Node Degrees

To better understand the dataset, we compute the number of neighbors (degree) for each node and visualize it.

# Compute the number of neighbors for each node
degrees = degree(data.edge_index[0]).numpy()
numbers = Counter(degrees)

fig, ax = plt.subplots()
ax.set_xlabel("Node Degree")
ax.set_ylabel("Number of nodes")
ax.bar(numbers.keys(), numbers.values())
plt.show()
Number of nodes with nodes degrees in Citeseer dataset

From the image above we realize that our dataset is left-skewed with a long tail to the left which ranges from 1 neighbor (1380 nodes) to 440 neighbors (1 node). As a result of this uneven distribution, the normalization process would be used to handle this scaling difference.

Loading the Facebook Page-to-Page Dataset

Similarly, we load the Facebook Page-to-Page dataset.

from torch_geometric.datasets import FacebookPagePage
from torch_geometric.transforms import NormalizeFeatures

dataset_fb = FacebookPagePage(root='./data')
data_fb = dataset_fb[0]

Analyzing Node Degrees in Facebook Dataset

We perform the same degree analysis on the Facebook dataset.

degrees_fb = degree(data_fb.edge_index[0]).numpy()
numbers_fb = Counter(degrees_fb)

fig, ax = plt.subplots()
ax.set_xlabel("Node Degree")
ax.set_ylabel("Number of nodes")
ax.bar(numbers_fb.keys(), numbers_fb.values())
plt.show()
Number of nodes with nodes degrees in the Facebook Page-Page dataset

We can observe that this dataset is more skewed than the Citeseer dataset with nodes ranging from 1 to 709.

Defining the GCN Model

We would not be building the GCN network from scratch though that can be done because luckily Pytorch Geometric has one already which we shall simply import.

Here, we define the GCN model using PyTorch. The model consists of two graph convolutional layers with a ReLU activation function in between.

import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

def accuracy(pred_y, y):
pred_y = pred_y.argmax(dim=1)
return ((pred_y == y).sum() / len(y)).item()

class GCN(torch.nn.Module):
def __init__(self, in_channels, hidden_channels, out_channels):
super(GCN, self).__init__()
self.gconv1 = GCNConv(in_channels, hidden_channels)
self.gconv2 = GCNConv(hidden_channels, out_channels)

def forward(self, x, edge_index):
x = self.gconv1(x, edge_index)
x = F.relu(x)
x = self.gconv2(x, edge_index)
return F.log_softmax(x, dim=1)

def fit(self, data, epochs):
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(self.parameters(), lr=0.01, weight_decay=5e-4)

self.train()
for epoch in range(epochs+1):
optimizer.zero_grad()
out = self(data.x, data.edge_index)
loss = criterion(out[data.train_mask], data.y[data.train_mask])
acc = accuracy(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()
if epoch % 20 == 0:
val_loss = criterion(out[data.val_mask], data.y[data.val_mask])
val_acc = accuracy(out[data.val_mask], data.y[data.val_mask])
print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Val Loss: {val_loss:.4f}, Acc: {acc:.4f}, Val Acc: {val_acc:.4f}')

@torch.no_grad()
def test(self, data):
self.eval()
out = self(data.x, data.edge_index)
acc = accuracy(out[data.test_mask], data.y[data.test_mask])
return acc

Training and Evaluating the GCN Model on Citeseer Dataset

We instantiate the GCN model and train it on the Citeseer dataset.

gcn = GCN(datasets.num_features, 16, datasets.num_classes)
print(gcn)
gcn.fit(data, epochs=100)
GCN complete model run on Facebook Citeseer dataset

We can already observe a significant improvement compared to the previous GNN architecture.

Preparing and Training on the Facebook Page-to-Page Dataset

We apply the necessary transformations and train the GCN model on the Facebook Page-to-Page dataset.

from torch_geometric.transforms import NormalizeFeatures, RandomNodeSplit
dataset_fb = FacebookPagePage(root='./data', transform=RandomNodeSplit(num_train_per_class=20, num_val=0.1, num_test=0.1))
data_fb = dataset_fb[0]

gcn = GCN(dataset_fb.num_features, 16, dataset_fb.num_classes)
print(gcn)
gcn.fit(data_fb, epochs=100)
GCN complete model run on Facebook Page-Page dataset

As can be observed from the table below the GCN architecture has a significant boost on this classification task on both datasets. We can observe a 7% to 9% change from GNN to GCN architecture if you run this model 10 times.

It is important to note that though we used the GCN architecture here for a classification task, we can also proceed to use this for a regression task.

One of the questions that we did not answer earlier from the section “Applying GCN to Graphs with Varying Topologie” was about the importance of one node over another. We will be exploring this in the next article by exploring Graph Attention Networks another variant of GNNs.

See you in the next article and happying graphing.

Previous article in series:

From graph topology to Node Features with Neural Networks https://medium.com/@amaboh/from-graph-topology-to-node-features-with-neural-networks-83b51c3a4019

Practical Deep Learning for Coders: Lesson 8, https://course.fast.ai/Lessons/lesson8.html

References

Aleksa G. (2021). Graph Convolutional Networks (GCN) GNN Paper Explain: Available at: https://www.youtube.com/watch?v=VyIOfIglrUM&list=PLBoQnSflObckArGNhOcNg7lQG_f0ZlHF5&index=3&ab_channel=AleksaGordi%C4%87-TheAIEpiphany

T.Kipf (2016). T. Kipf’s Website Graph Convolutional Neural Networks, Available at: https://tkipf.github.io/graph-convolutional-networks/

T. N. Kipf and M. Welling, Semi-Supervised Classification with Graph Convolutional Networks. arXiv, 2016. DOI: 10.48550/ARXIV.1609.02907. Available at : https://arxiv.org/abs/1609.02907

Soroush M. (2023): Graph Convolutional Networks (GNC): Available at: https://www.youtube.com/watch?v=eLcGehfjvgs&t=407s&ab_channel=SoroushMehraban

Resources

Colab Notebook of GCN on Citeseer and Facebook dataset: https://docs.google.com/presentation/d/1ifJaf0QbYfWeEiIAux-2exTVa4X4lzH2ULbxv50ug1I/edit?usp=sharing

Sketches and illustrations on GCN: https://docs.google.com/presentation/d/1ifJaf0QbYfWeEiIAux-2exTVa4X4lzH2ULbxv50ug1I/edit?usp=sharing

--

--

AMA

Data and ML engineer, I love exploring data and building ML systems to production. Love animes, books, music and having deep conversations.