Comprehensive Guide to GNN, GAT, and GCN: A Beginner’s Introduction to Graph Neural Networks After Reading 11 GNN Papers

pamperherself
14 min readAug 12, 2024

--

shot by pamperherself

Last week, I briefly explored GraphRAG.This week, I opened a folder with over 50 downloaded papers, randomly picked one to read, and quickly found myself delving into Graph Neural Networks (GNNs). Interestingly, GNN is somewhat related to the trending Graph RAG, which I also mentioned in my previous post on AI Information Search Methods.

After reading more than 10 papers and having several days of conversations with my GPTs, I’ve finally gathered enough material to write this article.

01

Before diving into the specific technologies and mechanisms related to Graph Neural Networks, let’s first clarify what we mean by “Graph” in Graph RAG and Graph Neural Networks.

The “Graph” here is not an image as we typically understand it — not a chart or a photograph.

As shown in the comparison images below, the Graph on the right is 3D, irregular, with nodes of varying sizes and edges of different lengths. These can exist on different planes and are non-Euclidean.

On the left, the nodes are uniformly sized, and the edges are equidistant, sometimes with directions and sometimes without, all on a single plane. This is Euclidean, similar to the mature sequence tasks we’re familiar with today, like text-to-image generation, summarization, and text generation.

Since Graphs have varying node sizes, edge lengths, and a 3D space, they can even incorporate a temporal dimension, making the training of Graph Neural Networks more complex and computationally expensive. However, this also results in better performance, stronger adaptability, and greater flexibility.

02

Among GNNs, there is the graph kernel method, suitable for small-scale, fixed graph data. This method relies on manually designed features, which cannot be learned and are pre-defined.

Overall, the significant development of GNNs began around 2017, starting with the most basic forms. By 2018, many papers, such as FastGCN, R-GCN, and GraphSage, emerged to further enhance computational power and aggregation algorithms on the initial GNN architecture. These studies also explored applications in fields like social media recommendation systems, activity recognition, and biomolecular detection. Another notable phase of GNN development was around 2021.

The emergence of Graph RAG is likely to spark even more research in this area.

Let’s clarify the names and concepts: GNN is the broadest concept, referring to Graph Neural Networks. GCN is a Convolutional Graph Neural Network, while GAT introduces an Attention mechanism into GCN, and GraphSage optimizes the aggregation algorithm on top of GCN. These three sub-concepts represent the mainstream technologies within GNN.

Less common variations include RGNN (Recurrent Graph Neural Networks), GRNN (Gated Graph Neural Networks), and Graph Autoencoders (GAEs). The most prevalent techniques in image processing remain Convolution and Transformer-based Attention mechanisms, as seen in tools like MidJourney and Stable Diffusion. Graph Neural Networks represent a further advancement in image processing.

The difference between RGNN and GRNN is illustrated in the image above, where GRNN adds a gating mechanism, similar to how LSTM or GRU enhances RNNs to improve long-distance memory in models. The research on recurrent neural networks in the GNN domain is relatively sparse, with most studies focusing on convolution.

Convolution: It involves placing a small filter over each part of the data (such as an image) to compute a new value representing that part’s feature. Filter: Think of it as a magnifying glass that helps us identify details in an image. The process of using this filter/magnifying glass on data is called convolution. Applications: Convolution is mainly used in image processing, helping us recognize features like edges and shapes in an image.

Above, we distinguish between spectral and spatial convolution:

Spectral convolution is not directly performed on the graph structure. Instead, it involves transforming the graph to the frequency domain using the Laplacian Matrix and Fourier Transform, performing convolution there, and then transforming it back. This method is suitable for global analysis and structured graphs.

However, because GCNs consider the entire graph, they become impractical as the graph grows. FastGCN was proposed to address this. Using social networks as an example:

Original GCNs required knowing all classmates’ answers, which became difficult as new classmates joined since all had to be considered. FastGCN introduced a method where only a portion of the classmates’ answers is needed (importance sampling using Monte Carlo methods) to complete the task effectively.

Spatial convolution, on the other hand, processes directly on the graph structure, aggregating node features from previous layers and neighboring nodes to propagate and learn information. GNN models like GAT and GraphSage fall into this category, focusing more on local aggregation, which is suitable for complex graph structures.

The following sampling modules classify aggregation sampling:

  1. Node Sampling:
  • GraphSAGE: Aggregates features by sampling a fixed number of neighboring nodes.
  • VR-GCN: Optimizes node sampling using variational inference.
  • PinSAGE: Designed for large-scale graphs, utilizing random walks and node sampling for feature aggregation, especially in recommendation systems.
  1. Layer Sampling:
  • FastGCN: Reduces computation by sampling nodes at each layer and performing convolution using importance sampling.
  • LADIES: Samples nodes at each layer while considering local neighbor information to improve efficiency and effectiveness.
  1. Subgraph Sampling:
  • ClusterGCN: Clusters the graph and operates within subgraphs to reduce computation.
  • GraphSAINT: Uses various subgraph sampling methods, like node, edge, or random walk sampling, to enhance model training efficiency and effectiveness.

03

Next, let’s delve into GraphSage and GAT.

GCNs include both training and test data in their training process, leading to poor performance when dealing with new nodes and a lack of generalization ability. GCNs are transductive and are better suited for static graphs that do not change.

GraphSAGE, on the other hand, is inductive, using only training data during training while handling new test data, improving the model’s generalization ability. It can generate embeddings for new nodes and edges with similar features that it hasn’t seen before.

For example, if you’ve trained a model on a protein interaction graph of a model organism (like mice), the model learns how to transform each protein node’s representation. When you obtain a new protein interaction graph of a different organism (like humans), you can quickly generate new embeddings for the nodes using the pre-trained model.

During training, the aggregators learn how to aggregate information from neighboring nodes to generate useful node embeddings.

During inference (testing), we use the trained aggregators to generate embeddings for nodes that have never been seen before.

GraphSAGE is spatial, sampling a fixed-size neighborhood for each node and performing specific aggregation regression functions within those neighborhoods rather than training each node individually (though the node’s own information is also included in the aggregation).

Each layer’s aggregator function further aggregates the previous layer, extracting more complex features layer by layer. For example, one layer aggregates the nearest neighbors, the second layer aggregates the neighbors’ neighbors, and so on, expanding the features from the smallest community to even revealing the entire graph’s structural characteristics.

Quick side note: I found Distill Hub to be a treasure trove of hardcore content while researching GCN. From 2016 to 2021, it published articles on basic ML technologies with interactive images, created by people from OpenAI, Google, Apple, etc., to make ML and DL concepts easier to understand for the general public. Though it doesn’t have many articles, it’s worth checking out if you’re interested in the foundational AI technologies.

How are the weights in aggregation considered? In purely convolutional mechanisms like GCN and GraphSAGE, node updates are achieved by averaging or summing the features of neighboring nodes, a method that ignores the varying importance of different neighboring nodes.

GAT, based on the Attention mechanism, introduces a core concept: not all neighboring nodes contribute equally to the current node’s update. By allowing the model to dynamically focus on more important neighboring nodes for the current task, GAT can more effectively capture the complex structures and relationships in the graph.

In social recommendation systems, for example, a colleague’s weight might be less important than a family member’s weight.

GAT has two types of attention mechanisms:

Global Attention: Ignores graph structure because each node connects to all other nodes in the graph without considering the graph’s original topology. Each node computes attention weights with all other nodes, capturing global correlations and similarities across the graph, suitable for tasks requiring full-graph information, like graph classification.

However, as each node must compute attention with all others, global attention is inefficient and computationally expensive for large-scale graphs.

Mask Attention: Only computes attention between neighboring nodes, respecting the graph’s original topology. The node’s state

and its relationship with neighboring nodes are calculated based on weights and similarities, making it suitable for tasks with strong local aggregation needs, such as link prediction and node classification.

Here’s a detailed explanation of link prediction, node classification, and graph classification, which helps in understanding the concept of graphs and the basic functions of GNNs.

Link prediction refers to predicting whether a certain relationship exists between two nodes in the graph. In this graph, the missing “citizen_of” relationship is a typical link prediction problem.

Entity classification refers to assigning a category label to a node. In this graph, the missing “:ballet_dancer” label is an entity classification problem.

03

Applications of GCN in various scenarios:

  • Biochemistry and Drug Development: Representing molecular structures as graphs, such as protein interaction networks, GCNs capture interactions between atoms and local environmental features.
  • Recommendation Systems and Social Networks: Multi-dimensional integration of various nodes — users and creators have different planes; for example, citation relationships in a paper graph are also multi-dimensional.
  • Skeletal Motion Capture and Recognition: Convolution operations on human skeleton motion sequences provide more flexibility and generality in learning human movements.

Below is an e-commerce recommendation system GraphRec, studied by Hong Kong Polytechnic University and JD.com in 2019. Pinterest had already adopted GCNs for its recommendation system in 2018, as detailed in the paper “Graph Convolutional Neural Networks for Web-Scale Recommender Systems.”

Platforms like Xiaohongshu feature relationships not only between people but also between people and products or posts, where these relationships influence each other. For instance, products you like are likely to be recommended to your friends, and posts your friends like are likely to be ones you’d enjoy as well.

User Modeling: Two aggregation methods (item aggregation and social aggregation) are used to handle data from these two types of graphs, ultimately combining them to obtain the user’s latent features.

Item Modeling: A user aggregation method is introduced, where user ratings and opinions are used to learn item feature representations.

Rating Prediction: The latent features of users and items are utilized through a neural network layer to predict the user’s rating for an item.

In GCN-based recommendation system designs, there are two aggregation methods for items and users. For such non-single-structure and entity graphs, there’s a concept called Heterogeneous GNNs.

Heterogeneous Graph Neural Networks (Heterogeneous GNNs) target graphs containing multiple types of nodes and edges. In many real-world networks, such as knowledge graphs and social networks, the graph structure is often heterogeneous, containing rich types of information. Heterogeneous GNNs, through specially designed message-passing mechanisms, can handle different types of nodes and edges, accurately capturing the complex structure of heterogeneous graphs. Due to the complexity of the data structure, the corresponding computational cost is also high.

Heterogeneous Strengths: This concept is where GAT introduces attention mechanisms in GCN to address the varying relationship strengths in more complex graphs.

One example of a Heterogeneous Attention Network (HAN) is a graph involving movies, directors, and actors, where the relationships are mixed. Instead of a single entity graph like just directors or actors, it’s a mix of different types of entities, making the graph more complex.

To organize relationships and sequences, the concept of a meta-path is introduced. For instance, if you want to find movies similar to “The Avengers,” the system might search through the movie — director — movie path. This path is the meta-path. It could also search through movie — actor — movie paths, with different meta-paths offering varying levels of accuracy.

The attention layers in the paper are divided into node attention and semantic attention:

Node Attention: Learns and aggregates relationships between nodes and their neighbors, processing nodes and their neighbors under a single meta-path.

Semantic Attention: Learns and integrates semantic information from different meta-paths, processing node embeddings under multiple meta-paths.

GNN’s application in skeletal motion capture and recognition includes both time and spatial dimensions. Human movements have a temporal aspect, making this not just 2D recognition of human actions at a specific point in time but processing them with both spatial and temporal considerations. The corresponding GNN technical term is Spatio-temporal Graph Neural Networks (ST GNNs), or Dynamic Graph Convolutional Networks (DGCN), or ST GCNs.

ST GNNs combine graph structural data (spatial dimension) and time-series data (temporal dimension) to capture complex dependencies over time and space. For example, in traffic flow prediction, ST GNN can use past traffic data (temporal dimension) and road network structure (spatial dimension) to predict future traffic conditions. The design of such GNN variants usually involves complex time-series analysis and graph representation learning techniques to efficiently process spatio-temporal data.

A 2018 paper by Peking University, “Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting,” studies the application of ST GCN in traffic.

Here’s a specific explanation using 2018 Hong Kong Polytechnic’s study on human skeletal action recognition.

ST-GCNs combine temporal and spatial information to perform convolution operations on skeleton sequences. The middle part of the figure shows the structure of multiple spatio-temporal graph convolutional network layers.

In each layer of convolution, a node aggregates the features of its neighboring nodes. This includes features from neighboring nodes in the same frame and corresponding nodes in adjacent frames. For example, the feature of a knee node in the current frame will aggregate not only the features of adjacent nodes (like the thigh and shin) in the same frame but also the knee node’s features in the preceding and following frames.

Here’s a detailed explanation of the graph’s partitioning strategy.

  • Nodes (blue dots) represent the joints of the human body, and edges (red dashed circles) represent the connections between joints.
  • Different partitioning strategies are used to label and group nodes and edges in various ways (such as uniform labeling, distance partitioning, and spatial configuration partitioning).

(a) Initial Image:

  • Description: This is a diagram of a human skeleton, with each joint represented by a blue dot. The red dashed circle represents the receptive field of the convolution operation, which is a joint and its directly adjacent joints.
  • Understanding: Imagine a person standing there, with the red dashed circle surrounding one joint and its neighboring joints.

(b) Uniform Labeling Partitioning Strategy (Uni-labeling):

  • Description: In this strategy, all nodes within the same receptive field are labeled with the same label, here marked in green.
  • Understanding: Imagine you have a group of friends, and everyone is wearing the same color clothing because they are all your friends, representing that everyone is part of the same group.

© Distance Partitioning Strategy (Distance Partitioning):

  • Description: In this strategy, nodes are partitioned based on their distance from the root node. Green represents the root node (distance of 0), and blue represents neighboring nodes at a distance of 1 from the root node.
  • Understanding: Imagine you are standing in the middle (green), with your closest friends standing around you (blue), and friends standing farther away are in a more distant position.

(d) Spatial Configuration Partitioning Strategy (Spatial Configuration):

  • Description: Nodes are partitioned based on their distance from the skeleton’s center of gravity (black cross). Nodes closer to the center are marked in blue (centripetal nodes), and those farther away are marked in yellow (centrifugal nodes), with the root node marked in green.
  • Understanding: Imagine you are standing at the center of gravity, with your closest friends wearing blue and farther friends wearing yellow, while you wear green.
shot by pamperherself

Epilogue

This 4000+ word introduction should give you a basic understanding of GNN. As Graph RAG gains popularity, more attention is being given to research on graph recognition algorithms. Abroad, articles like “2024 AI Trend: Graph Neural Networks” are starting to appear, though the work on organizing Graphs is still in its early stages. Hopefully, we’ll see more relevant research and projects emerge.

With the support of Graph, we can take the quality of RAG-based text conversations to the next level.

Below, the green box shows the materials used for this article. My previous writings on Introduction to Transformers and Advanced Transformers are also stored on this whiteboard.

I could have finished this article yesterday, but I spent half the day agonizing over whether to switch from my whiteboard software to Procrafts or pay for iCloud, as my iCloud storage was full. In the end, I paid for iCloud to keep using Freeform because it’s too much of a

hassle to migrate content to a new note-taking app. Even when I ran out of phone storage for photos, I didn’t buy iCloud, but Freeform is too much of a black box — I don’t even know where the whiteboard file is stored, and it can’t be transferred or downloaded.

Lastly, I want to share a few thoughts on knowledge management and building a second brain. An American book titled Building a Second Brain mentions four steps to building a second brain:

  1. Capture: Select the information that truly moves you and is useful.
  2. Organize: Decide where to place the information, under which theme or framework, and on which app.
  3. Distill: Further refine the information, filtering out the unnecessary and deepening the essence.
  4. Express: The process of linking, processing, and creating something new from the “semi-processed materials.”

These four steps explain how I approach writing these AI-related articles.

The process begins with gathering various “semi-finished” materials — the key points and ideas distilled from different sources. At first, these materials may seem unimportant, but once you’ve accumulated enough, you only need to piece together these “semi-finished” materials to build a complete project or work.

That’s exactly how this article came together, with a clear purpose in mind. From the first AI paper I read, I determined the theme and began Thematic Active Search, which allowed me to compile a complete article in just a few days. In contrast, topics like gradient descent and LLM training remain underdeveloped due to a lack of thematic research and search.

What I care most about is the impact these four steps have on my brain — all these processes are tools to help me understand better. Paying for tools is just a means to facilitate my learning and progress.

I’ve spent a bit too much time lately on learning methods, so it’s time to return to AI-related content.

By: pamperherself

shot by pamperherself

--

--

pamperherself

AI and Fashion blogger | Portrait Photographer Youtube | Instagram : @pamperherself