Graph Neural Nets Explained: Summary of different graph embedding methods.

13 min readDec 5, 2023

Node2Vec :

https://arxiv.org/abs/1607.00653

Old paper from 2016. Apply word2vec style embedding learning to a graph. Has mostly been replaced by other more recent methods summarized below due to limitation of not capturing node and node relationship features. Mostly uses node connectivity information to create node embeddings.

Advantages: Simple and fast. Robust to perturbations in the form of noisy or missing edges. It can scale to large networks with millions of nodes in a few hours.

Limitations: node2vec algorithm does not differentiate between node or relationship types. If your graph has many node or relationship types, the node2vec algorithm will not consider that. Secondly, node properties or features cannot be added to the algorithm input.

GNN (Graph neural Nets)

Family of neural networks that can operate naturally on graph-structured data. By extracting and utilizing features from the underlying graph, GNNs can make more informed predictions about entities in these interactions

Modern popular GNN algorithms:

Graph Convolutional Networks (GCN)
Graph Attention Networks (GAT)
Graph Sample and Aggregate (GraphSAGE)
Graph Isomorphism Network (GIN)

Before we describe these algorithms lets go over two categories of learning.

Inductive and transductive learning

are two approaches to graph learning in graph neural networks (GNNs). The main difference between the two is that inductive learning only encounters training data, while transductive learning encounters both training and testing data but only see the features / graph structure for the test data and the predictions for those test nodes are used to validate the algoritms. Inductive learning learns a function that can be applied to any new inputs, Inductive learning tests on unseen nodes. Inductive learning is better suited to improve the generalization power of a model.

GraphSAGE

https://arxiv.org/abs/1706.02216

GraphSAGE, a general inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data.

The basic idea behind node embedding approaches is to use dimensionality reduction techniques to distill the high-dimensional information about a node’s graph neighborhood into a dense vector embedding. These node embeddings can then be fed to downstream machine learning systems and aid in tasks such as node classification, clustering, and link prediction.

There are also recent approaches to learning over graph structures using convolution operators that offer promise as an embedding methodology [17]. So far, graph convolutional networks (GCNs) have only been applied in the transductive setting with fixed graphs [17, 18]. In this work we both extend GCNs to the task of inductive unsupervised learning and propose a framework that generalizes the GCN approach to use trainable aggregation functions (beyond simple convolutions).

Instead of training a distinct embedding vector for each node, we train a set of aggregator functions that learn to aggregate feature information from a node’s local neighborhood (Figure 1). Each aggregator function aggregates information from a different number of hops, or search depth, away from a given node.

We first describe the GraphSAGE embedding generation (i.e., forward propagation) algorithm, which generates embeddings for nodes assuming that the GraphSAGE model parameters are already learned (Section 3.1). We then describe how the GraphSAGE model parameters can be learned using standard stochastic gradient descent and backpropagation techniques.

Embedding generation (i.e., forward propagation) algorithm

The intuition behind Algorithm 1 is that at each iteration, or search depth, nodes aggregate information from their local neighbors, and as this process iterates, nodes incrementally gain more and more information from further reaches of the graph. Algorithm 1 describes the embedding generation process in the case where the entire graph, G = (V, E), and features for all nodes Xv, ∀v ∈ V, are provided as input.

k denotes the current step in the outer loop (or the depth of the search) and h k denotes a node’s representation at this step: First, each node v ∈ V aggregates the representations of the nodes in its immediate neighborhood, {h k−1 u , ∀u ∈ N (v)}, into a single vector h k−1 N(v) . Note that this aggregation step depends on the representations generated at the previous iteration of the outer loop (i.e., k − 1), and the k = 0 (“base case”) representations are defined as the input node features. After aggregating the neighboring feature vectors, GraphSAGE then concatenates the node’s current representation, h k−1 v , with the aggregated neighborhood vector, h k−1 N(v) , and this concatenated vector is fed through a fully connected layer with nonlinear activation function σ, which transforms the representations to be used at the next step of the algorithm (i.e., h k v , ∀v ∈ V). The aggregation of the neighbor representations can be done by a variety of aggregator architectures.

Neighborhood definition: In this work, we uniformly sample a fixed-size set of neighbors, instead of using full neighborhood sets in Algorithm 1, in order to keep the computational footprint of each batch 4 fixed

Learning the parameters of GraphSAGE

In order to learn useful, predictive representations in a fully unsupervised setting, we apply a graph-based loss function to the output representations, zu, ∀u ∈ V, and tune the weight matrices, Wk , ∀k ∈ {1, …, K}, and parameters of the aggregator functions via stochastic gradient descent. The graph-based loss function encourages nearby nodes to have similar representations, while enforcing that the representations of disparate nodes are highly distinct:

This unsupervised setting emulates situations where node features are provided to downstream machine learning applications, as a service or in a static repository. In cases where representations are to be used only on a specific downstream task, the unsupervised loss (Equation 1) can simply be replaced, or augmented, by a task-specific objective (e.g., cross-entropy loss)

Aggregator Architecture: Mean aggregator, LSTM aggregator, Pooling aggregator.

PinSage: Graph Convolutional Neural Networks for Web-Scale Recommender Systems — 2018

Paper: https://arxiv.org/pdf/1806.01973.pdf

Extension of GraphSage with some key enhancements:

Key highlights form the paper:

Traditional GCN algorithms perform graph convolutions by multiplying feature matrices by powers of the full graph Laplacian. In contrast, our PinSage algorithm performs efficient, localized convolutions by sampling the neighborhood around a node and dynamically constructing a computation graph from this sampled neighborhood. These dynamically constructed computation graphs (Fig. 1) specify how to perform a localized convolution around a particular node, and alleviate the need to operate on the entire graph during training.

Constructing convolutions via random walks: Taking full neighborhoods of nodes to perform convolutions (Fig. 1) would result in huge computation graphs, so we resort to sampling. However, random sampling is suboptimal, and we develop a new technique using short random walks to sample the computation graph. An additional benefit is that each node now has an importance score, which we use in the pooling/aggregation step.

We introduce a method to weigh the importance of node features in this aggregation based upon randomwalk similarity measures, leading to a 46% performance gain in offline evaluation metrics.

Despite the successes of GCN algorithms, no previous works have managed to apply them to production-scale data with billions of nodes and edges — a limitation that is primarily due to the fact that traditional GCN methods require operating on the entire graph Laplacian during training.

Our task is to generate high-quality embeddings or representations of pins that can be used for recommendation (e.g., via nearestneighbor lookup for related pin recommendation, or for use in a downstream re-ranking system). In order to learn these embeddings, we model the Pinterest environment as a bipartite graph consisting of nodes in two disjoint sets, I (containing pins) and C (containing boards). Note, however, that our approach is also naturally generalizable, with I being viewed as a set of items and C as a set of user-defined contexts or collections. In the case of Pinterest, we have that pins are associated with both rich text and image features. Our goal is to leverage both these input attributes as well as the structure of the bipartite graph to generate high-quality embeddings.

3.2 Model Architecture

Forward propagation algorithm. thsi is similar to GraphSage described above except for following enhancements:

The core of our PinSage algorithm is a localized convolution operation, where we learn how to aggregate information from u’s neighborhood (Figure 1). This procedure is detailed in Algorithm 1 convolve. The basic idea is that we transform the representations Zv , ∀v ∈ N (u) of u’s neighbors through a dense neural network and then apply a aggregator/pooling fuction (e.g., a element-wise mean or weighted sum, denoted as γ ) on the resulting set of vectors. We then concatenate the aggregated neighborhood vector nu with u’s current representation hu and transform the concatenated vector through another dense neural network layer (Line 2). We observe significant performance gains when using concatenation operation instead of the average operation. Additionally, the normalization in Line 3 makes training more stable, and it is more efficient to perform approximate nearest neighbor search for normalized embeddings.

Importance-based neighborhoods. An important innovation in our approach is how we define node neighborhoods N (u), i.e., how we select the set of neighbors to convolve over in Algorithm 1. the neighborhood of a node u is defined as the T nodes that exert the most influence on node u. we simulate random walks starting from node u and compute the L1-normalized visit count of nodes visited by the random walk [14].2 The neighborhood of u is then defined as the top T nodes with the highest normalized visit counts with respect to node u. The advantages of this importance-based neighborhood definition are two-fold. First, selecting a fixed number of nodes to aggregate from allows us to control the memory footprint of the algorithm during training [18]. Second, it allows Algorithm 1 to take into account the importance of neighbors when aggregating the vector representations of neighbors. In particular, we implement γ in Algorithm 1 as a weighted-mean, with weights defined according to the L1 normalized visit counts. We refer to this new approach as importance pooling.

Stacking convolutions. Each time we apply the convolve operation (Algorithm 1) we get a new representation for a node, and we can stack multiple such convolutions on top of each other in order to gain more information about the local graph structure around node u. In particular, we use multiple layers of convolutions, where the inputs to the convolutions at layer k depend on the representations output from layer k − 1 (Figure 1) and where the initial (i.e., “layer 0”) representations are equal to the input node features. Note that the model parameters in Algorithm 1 (Q, q, W, and w) are shared across the nodes but differ between layers. Algorithm 2 details how stacked convolutions generate embeddings for a minibatch set of nodes, M. We first compute the neighborhoods of each node and then apply K convolutional iterations to generate the layer-K representations of the target nodes. The output of the final convolutional layer is then fed through a fullyconnected neural network to generate the final output embeddings

Loss function. In order to train the parameters of the model, we use a max-margin-based loss function. The basic idea is that we want to maximize the inner product of positive examples, i.e., the embedding of the query item and the corresponding related item. At the same time we want to ensure that the inner product of negative examples — i.e., the inner product between the embedding of the query item and an unrelated item — is smaller than that of the positive sample by some pre-defined margin.

Multi-GPU training with large minibatches. To make full use of multiple GPUs on a single machine for training, we run the forward and backward propagation in a multi-tower fashion. With multiple GPUs, we first divide each minibatch (Figure 1 bottom) into equal-sized portions. Each GPU takes one portion of the minibatch and performs the computations using the same set of parameters. After backward propagation, the gradients for each parameter across all GPUs are aggregated together, and a single step of synchronous SGD is performed. Due to the need to train on extremely large number of examples (on the scale of billions), we run our system with large batch sizes, ranging from 512 to 4096.

Sampling negative items. To improve efficiency when training with large batch sizes, we sample a set of 500 negative items to be shared by all training examples in each minibatch. This drastically saves the number of embeddings that need to be computed during each training step, compared to running negative sampling for each node independently. Empirically, we do not observe a difference between the performance of the two sampling schemes.

In the simplest case, we could just uniformly sample negative examples from the entire set of items. However, ensuring that the inner product of the positive example (pair of items (q, i)) is larger than that of the q and each of the 500 negative items is too “easy” and does not provide fine enough “resolution” for the system to learn. In particular, our recommendation algorithm should be capable of finding 1,000 most relevant items to q among the catalog of over 2 billion items. In other words, our model should be able to distinguish/identify 1 item out of 2 million items. But with 500 random negative items, the model’s resolution is only 1 out of 500. Thus, if we sample 500 random negative items out of 2 billion items, the chance of any of these items being even slightly related to the query item is small. Therefore, with large probability the learning will not make good parameter updates and will not be able to differentiate slightly related items from the very related ones.

“hard negative items”. They are generated by ranking items in a graph according to their Personalized PageRank scores with respect to query item q [14]. Items ranked at 2000–5000 are randomly sampled as hard negative items. The hard negative examples are more similar to the query than random negative examples, and are thus challenging for the model to rank, forcing the model to learn to distinguish items at a finer granularity. Using hard negative items throughout the training procedure doubles the number of epochs needed for the training to converge. To help with convergence, we develop a curriculum training scheme [4]. In the first epoch of training, no hard negative items are used, so that the algorithm quickly finds an area in the parameter space where the loss is relatively small. We then add hard negative items in subsequent epochs, focusing the model to learn how to distinguish highly related pins from only slightly related ones. At epoch n of the training, we add n − 1 hard negative items to the set of negative items for each item.

Features used for learning. Each pin at Pinterest is associated with an image and a set of textual annotations (title, description). To generate feature representation xq for each pin q, we concatenate visual embeddings (4,096 dimensions), textual annotation embeddings (256 dimensions), and the log degree of the node/pin in the graph. The visual embeddings are the 6-th fully connected layer of a classification network using the VGG-16 architecture [28]. Textual annotation embeddings are trained using a Word2Vec-based model [23], where the context of an annotation consists of other annotations that are associated with each pin.

Offline Evaluation

For each positive pair of pins (q, i) in the test set, we use q as a query pin and then compute its top K nearest neighbors NNq from a sample of 5 million test pins. We then define the hit-rate as the fraction of queries q where i was ranked among the top K of the test sample (i.e., where i ∈ NNq ). This metric directly measures the probability that recommendations made by the algorithm contain the items related to the query pin q. In our experiments K is set to be 500. We also evaluate the methods using Mean Reciprocal Rank (MRR), which takes into account of the rank of the item j among recommended items for query item q.

Due to the large pool of candidates (more than 2 billion), we use a scaled version of the MRR in Equation (2), where Ri,q is the rank of item i among recommended items for query q, and n is the total number of labeled item pairs. The scaling factor 100 ensures that, for example, the difference between rank at 1, 000 and rank at 2, 000 is still noticeable, instead of being very close to 0.

Embedding similarity distribution. Another indication of the effectiveness of the learned embeddings is that the distances between random pairs of item embeddings are widely distributed. If all items are at about the same distance (i.e., the distances are tightly clustered) then the embedding space does not have enough “resolution” to distinguish between items of different relevance.

The distribution of cosine similarities between pairs of items using annotation, visual, and PinSage embeddings. This distribution of cosine similarity between random pairs of items demonstrates the effectiveness of PinSage, which has the most spread out distribution. In particular, the kurtosis of the cosine similarities of PinSage embeddings is 0.43, compared to 2.49 for annotation embeddings and 1.20 for visual embeddings.

Another important advantage of having such a wide-spread in the embeddings is that it reduces the collision probability of the subsequent LSH algorithm, thus increasing the efficiency of serving the nearest neighbor pins during recommendation.

Resources and Credits:

A Gentle Introduction to Graph Neural Networks

Understanding Convolutions on Graphs

https://www.youtube.com/playlist?list=PLBoQnSflObckArGNhOcNg7lQG_f0ZlHF5