Graph Neural Networks Series | Part 3 | Node embedding

Published in

The Modern Scientist

16 min readNov 28, 2022

Introduction

Graph neural networks (GNNs) are a type of neural network that can operate on graphs. A GNN can be used to learn a representation of the nodes in a graph, known as a node embedding. Node embeddings can be used for a variety of tasks, such as link prediction and node classification.

What is an Embedding ?

An embedding is a mapping from an input space to a lower-dimensional space. For example, we can map a 3D vector to a 2D vector. We can think of an embedding as a function that takes an input (such as a node in a graph) and outputs a vector. Node embeddings are a type of embedding that maps nodes in a graph to vectors. Node embeddings can be used for a variety of tasks, such as link prediction and node classification.

Why Use Node Embeddings ?

Node embeddings have a number of advantages over other methods for learning node representations. First, node embeddings can be trained using standard neural network optimization methods. Second, node embeddings can be trained on graphs with arbitrary structure. Finally, node embeddings can be used to represent nodes in multiple graphs simultaneously.

How are Node Embeddings Learned ?

Node embeddings are learned by training a neural network to map nodes in a graph to low-dimensional vectors. The network is trained using a loss function that encourages the vectors for similar nodes to be close together and the vectors for dissimilar nodes to be far apart.

To explain this a bit more,
When we say that node embeddings are learned, it means that we train a neural network to generate low-dimensional vector representations (embeddings) for each node in a graph. The process of learning node embeddings involves training the neural network using a specific objective or loss function.

The neural network takes the graph as input and applies a series of operations to transform the node features and capture the relationships between nodes. These operations can include graph convolutional layers, pooling layers, or attention mechanisms, depending on the specific architecture of the neural network.

During training, the network aims to optimize its parameters to produce embeddings that capture meaningful information about the nodes in the graph. This optimization is achieved by minimizing a loss function that quantifies the dissimilarity between the predicted node embeddings and the desired target embeddings.

The loss function encourages the network to make embeddings of similar nodes closer together in the embedding space, while pushing embeddings of dissimilar nodes farther apart. This process helps the network to learn representations that capture the underlying structure, characteristics, and relationships within the graph.

By iteratively updating the network’s parameters and minimizing the loss, the neural network gradually learns to generate node embeddings that encode useful information about the nodes in the graph. These embeddings can then be used for various downstream tasks, such as link prediction, node classification, or visualization.

In summary, learning node embeddings involves training a neural network to map nodes in a graph to low-dimensional vectors through an optimization process that encourages similar nodes to have close embeddings and dissimilar nodes to have distant embeddings.

Example of the process:

It is okay if you don’t fully understand the example here, that’s okay, I just want you to get a bird’s eye view of the process.

Let’s consider a transportation network consisting of several cities connected by roads. We want to learn node embeddings for each city in the network using a graph neural network (GNN) approach. Here’s how the process might unfold:

Data Preparation: We collect data about the transportation network, including information about cities, their geographical coordinates, and the roads connecting them. This data forms the basis of our graph representation.
Graph Construction: We represent the transportation network as a graph, where each city is a node and the roads between cities are edges. The graph structure captures the connections and relationships between cities.
Node Features: We can enrich the graph by including additional features for each city, such as population size, transportation infrastructure, or economic indicators. These features provide context and additional information about the cities.
Node Embedding Learning: We train a graph neural network on this transportation graph to learn node embeddings. The GNN takes the graph structure and node features as input and applies neural network operations to capture the relationships and patterns in the data.
Loss Function: During training, we define a loss function that measures the dissimilarity between the predicted node embeddings and the desired target embeddings. The loss function encourages similar cities (e.g., cities with similar transportation characteristics or geographical proximity) to have embeddings that are close together in the embedding space.
Optimization: The GNN is optimized by iteratively updating its parameters to minimize the loss function. This optimization process fine-tunes the neural network to generate embeddings that capture relevant information about each city’s transportation characteristics.
Node Embedding Interpretation: Once the GNN is trained, we can interpret the learned node embeddings. The embeddings encode meaningful representations of the cities in a lower-dimensional space. We can analyze the embeddings to identify clusters of similar cities based on transportation characteristics, uncover central or influential cities based on their embedding proximity to others, or even use the embeddings as inputs for downstream tasks such as predicting traffic patterns or recommending optimal routes.

By learning node embeddings in this transportation network, we can gain insights into the structural properties, relationships, and characteristics of the cities within the transportation system. The node embeddings capture important information about the cities’ transportation features, allowing us to perform various analyses and make informed decisions related to transportation planning, infrastructure improvements, or optimizing travel experiences for individuals within the network.

I want to go back and expand a bit on point number 4 Node Embedding Learning because there is a lot more to talk about here. Here are some of the things you need to know.

Node Embedding Learning

Input Representation: In our transportation network, the input to the GNN consists of the graph structure, represented by the adjacency matrix or edge lists, and node features such as population size, transportation infrastructure, or economic indicators. In PyTorch Geometric ( A famous python framework for graphs), the edge structure of the graph can be represented using the edge_index list. edge_index is a 2D tensor that contains the indices of the nodes connected by an edge. Each column of the edge_index tensor represents an edge in the graph. For example, if edge_index has shape (2, E), where E is the number of edges, the first row contains the indices of the source nodes and the second row contains the indices of the target nodes.

The edge_index list is a common way to represent the graph structure in many graph neural network libraries, including PyTorch Geometric. It provides a compact and efficient representation that allows the GNN to efficiently process the edge connections during the message passing and aggregation steps.

In addition to the edge structure, the GNN also requires node features to learn meaningful embeddings. These node features can include various attributes or properties of the nodes, such as population size, transportation infrastructure, or economic indicators. The node features are typically represented as a node feature matrix, where each row corresponds to a node and each column represents a specific feature. The node feature matrix is then combined with the edge structure (i.e., edge_index) to form the input representation for the GNN.

Here’s an example of representing a transportation network using the edge_index list and node features in PyTorch Geometric:

import torch
from torch_geometric.data import Data

# Define the adjacency matrix (edge connections)
edge_index = torch.tensor([[0, 1, 1, 2, 3, 3, 4, 4], [1, 0, 2, 1, 2, 4, 3, 5]], dtype=torch.long)

# Define the node features
node_features = torch.tensor([[100, 5.0, 20], [200, 4.2, 15], [150, 4.8, 18], [180, 4.6, 25], [120, 4.9, 12], [250, 4.1, 30]], dtype=torch.float)

# Create the PyTorch Geometric Data object
data = Data(x=node_features, edge_index=edge_index)

# Print the data object
print(data)

In this example, we have a transportation network with 6 nodes and 8 edges. The edge_index represents the connections between the nodes, where each column of the edge_index tensor corresponds to an edge. The node_features tensor contains the features of each node, such as population size, transportation infrastructure, and economic indicators.

By creating a Data object with the node_features and edge_index, we can easily pass this representation to a GNN model in PyTorch Geometric for further processing, such as learning node embeddings or performing downstream tasks on the transportation network data.

Message Passing (We will revisit this point later): The GNN propagates information across the graph through a series of message passing steps. At each step, information is aggregated from neighboring nodes and combined with the node’s own features to update its representation.

Graph Convolutional Layers (We will revisit this point later): The GNN typically consists of multiple graph convolutional layers, where each layer applies a graph-based operation to refine the node representations. These operations often involve aggregating information from neighboring nodes and applying non-linear transformations.

Non-Linearity and Activation Functions: Non-linear activation functions such as ReLU (Rectified Linear Unit) or sigmoid are applied to introduce non-linearity and capture complex relationships between nodes and their features.

Parameter Learning: During training, the GNN’s parameters (weights and biases) are adjusted through an optimization process. This process involves forward propagation to compute the predicted node embeddings, followed by backpropagation to calculate gradients and update the parameters using techniques like stochastic gradient descent (SGD) or Adam optimizer.

An Encode-Decoder perspective

This way of viewing graph representation learning is very common in how node embedding methods are represented. In the encoder-decoder framework, we view the process of learning node embeddings as learning a mapping from nodes in a graph to low-dimensional vectors, and then learning a mapping back from the vectors to the nodes. The encoder is a neural network that maps nodes in a graph to low-dimensional vectors. The decoder is a neural network that maps the vectors back to the nodes. The encoder and decoder are trained jointly so that the vectors produced by the encoder can be used by the decoder to reconstruct the original graph. The loss function used to train the encoder-decoder is based on the idea that similar nodes should have similar embeddings. This loss function encourages the embeddings for similar nodes to be close together and the embeddings for dissimilar nodes to be far apart.

The encoder-decoder framework can be used to learn node embeddings for any type of graph, including directed graphs, undirected graphs, and graphs with multiple types of nodes and edges. The encoder-decoder framework can be used to learn node embeddings for any type of graph, including directed graphs, undirected graphs, and graphs with multiple types of nodes and edges.

The encoder

Nodes in Graph get encoded to an embedded space.

The encoder plays a crucial role in learning meaningful representations of nodes in a graph. The encoder is a neural network component responsible for mapping nodes in the graph to low-dimensional vectors, often referred to as node embeddings. The encoder operates on the input graph data, which typically includes the graph structure (e.g., adjacency matrix or edge lists) and node features (e.g., attributes, properties, or contextual information associated with each node).

The encoder performs the following key tasks:

Feature Extraction: The encoder analyzes the input graph data and extracts relevant features from each node. This can involve processing node attributes, incorporating neighborhood information, and considering the global graph structure. The goal is to capture informative characteristics of the nodes that are indicative of their roles, relationships, or importance within the graph.
Non-linear Transformation: The encoder applies non-linear transformations to the extracted features, allowing for complex relationships and patterns in the graph to be captured. This can be achieved through the use of activation functions, such as ReLU (Rectified Linear Unit), which introduce non-linearities into the encoding process.
Embedding Generation: The encoder combines the transformed features to generate low-dimensional vector representations, known as node embeddings. These embeddings aim to capture essential information about each node in a compact and meaningful way. The encoder learns to map nodes to embeddings in a manner that maximizes the representation’s ability to capture relevant graph properties, such as connectivity patterns or community structures.
Latent Space Encoding: The encoder projects the node embeddings into a latent space, where nodes with similar properties or behaviors are expected to be close together. The latent space representation enables efficient comparisons, clustering, and downstream tasks that rely on similarity measures or dimensionality reduction techniques.

The Decoder

The role of the decoder is to reconstruct certain graph statistics from the node embedding that are generated by the encode. Given a Node embedding Z: Node embedding, the decoder needs to be able to decode this embedding back to the original node. In the case of link prediction, the decoder is a function that takes as input the node embedding z and outputs the set of neighbors OR it the graph adjacency matrix. While there are many ways of building decoders but the standard method to create a decoder is to define pairwise decoders. Pairwise decoders predict the relationship OR similarity between Pairs of nodes in the graph. So lets refer to this pairwise decoder just as decoder. The pair embeddings (Zu,Zv) are passed to the decoder which outputs the connection OR similarity between the two nodes. The goal is to optimize the encoder and decoder to minimize the reconstruction loss of that relationship. Formally define the decoder-encoder mathematical equation. dec(enc(u),enc(v)) = dec(Zu,Zv). This is approximately S[𝑢,𝑣] The above equation says that the decoder function should return the same result when it takes the encoder output of two nodes u and v as input or the node embedding of those two nodes as input.

Optimizing an encode-decoder model

The encode-decoder model can be optimized using different methods. The most common method is by using Reconstruction loss which is defined as how well the decode can predict the similarity between two nodes in the graph. In this case, the reconstruction loss L = S[𝑢,𝑣] — dec(enc(u),enc(v)) , Where 𝑆[𝑢,𝑣] ~ A[𝑢,𝑣]. The above equation says that the reconstruction loss is the difference between the actual similarity between the two nodes and the predicted similarity. To optimize the model, the goal is to minimize the reconstruction loss. `Whatever that similarity is` This depends on how you define the decoder. the loss function might be a mean-squared error or a classification loss such as cross entropy.

Factorization-Based approaches in Graphs.

Factorization-based methods for learning node representations are based on the idea of factorizing a matrix that represents the relationships between nodes in a graph. For example, given a graph with 𝑁 nodes and 𝑀 edges, we can represent the relationships between the nodes using an 𝑁×𝑁 adjacency matrix 𝐴. Factorizing the adjacency matrix 𝐴 results in two 𝑁×𝐾 matrices 𝑊 and 𝑉, where 𝐾 is the number of dimensions of the node embeddings. The matrix 𝑊 represents the node embeddings and the matrix 𝑉 represents the edge embeddings. Factorizing the adjacency matrix 𝐴 results in two 𝑁×𝐾 matrices 𝑊 and 𝑉, where 𝐾 is the number of dimensions of the node embeddings. The matrix 𝑊 represents the node embeddings and the matrix 𝑉 represents the edge embeddings. There are a number of different ways to factorize the adjacency matrix 𝐴, including singular value decomposition (SVD) and non-negative matrix factorization (NMF).

Random walks embedding

Random walk-based methods for learning node representations are based on the idea of randomly walking through the graph and using the sequence of nodes visited by the random walk to learn a node embedding. For example, given a graph with 𝑁 nodes and 𝑀 edges, we can represent the relationships between the nodes using an 𝑁×𝑁 adjacency matrix 𝐴. We can then use the adjacency matrix to define a transition matrix 𝑇, which is a 𝑁×𝑁 matrix that defines the probabilities of transitioning from one node to another. We can then use the transition matrix to generate a random walk through the graph. The random walk can be used to learn a node embedding by representing each node in the graph as a vector. The elements of the vector are the probabilities of transitioning from the node to each of the other nodes in the graph.

Here is an analogy of random walks: Imagine you are in a city and you want to go to the grocery store. You could take a bus, which would take you from your current location to the grocery store. Or, you could walk. If you walk, you will likely take a different path each time, depending on the sidewalks, crosswalks, and other obstacles in your way. However, if you take the bus, you will likely take the same route each time. In this analogy, the city is the graph, the grocery store is the target node, the bus is the transition matrix, and the path you take is the random walk. There are a number of different ways to generate a random walk through a graph, including uniform sampling, Metropolis-Hastings sampling, and Gibbs sampling. But simply put the random walks purpose is to generate a sequence of nodes that can be used to learn a node embedding. The advantages of random walk-based methods over traditional methods for learning node representations is that they can be used to learn node embeddings for any type of graph, including directed graphs, undirected graphs, and graphs with multiple types of nodes and edges. In addition, random walk-based methods can be used to learn node embeddings for graphs with arbitrary structure.

Link Prediction Example

We use node embeddings here for link prediction.

import os.path as osp

import torch
from sklearn.metrics import roc_auc_score

import torch_geometric.transforms as T
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv
from torch_geometric.utils import negative_sampling

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
transform = T.Compose([
    T.NormalizeFeatures(),
    T.ToDevice(device),
    T.RandomLinkSplit(num_val=0.05, num_test=0.1, is_undirected=True,
                      add_negative_train_samples=False),
])
path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'Planetoid')
dataset = Planetoid(path, name='Cora', transform=transform)
# After applying the `RandomLinkSplit` transform, the data is transformed from
# a data object to a list of tuples (train_data, val_data, test_data), with
# each element representing the corresponding split.
train_data, val_data, test_data = dataset[0]


class Net(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, out_channels)

    def encode(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        return self.conv2(x, edge_index)

    def decode(self, z, edge_label_index):
        return (z[edge_label_index[0]] * z[edge_label_index[1]]).sum(dim=-1)

    def decode_all(self, z):
        prob_adj = z @ z.t()
        return (prob_adj > 0).nonzero(as_tuple=False).t()


model = Net(dataset.num_features, 128, 64).to(device)
optimizer = torch.optim.Adam(params=model.parameters(), lr=0.01)
criterion = torch.nn.BCEWithLogitsLoss()


def train():
    model.train()
    optimizer.zero_grad()
    z = model.encode(train_data.x, train_data.edge_index)

    # We perform a new round of negative sampling for every training epoch:
    neg_edge_index = negative_sampling(
        edge_index=train_data.edge_index, num_nodes=train_data.num_nodes,
        num_neg_samples=train_data.edge_label_index.size(1), method='sparse')

    # Concat positive and negative edge indices.
    edge_label_index = torch.cat(
        [train_data.edge_label_index, neg_edge_index],
        dim=-1,
    )
    # Label for positive edges: 1, for negative edges: 0.
    edge_label = torch.cat([
        train_data.edge_label,
        train_data.edge_label.new_zeros(neg_edge_index.size(1))
    ], dim=0)

    # Note: The model is trained in a supervised manner using the given
    # `edge_label_index` and `edge_label` targets.
    out = model.decode(z, edge_label_index).view(-1)
    loss = criterion(out, edge_label)
    loss.backward()
    optimizer.step()
    return loss


@torch.no_grad()
def test(data):
    model.eval()
    z = model.encode(data.x, data.edge_index)
    out = model.decode(z, data.edge_label_index).view(-1).sigmoid()
    return roc_auc_score(data.edge_label.cpu().numpy(), out.cpu().numpy())

# Train/Test Loop
best_val_auc = final_test_auc = 0
for epoch in range(1, 101):
    loss = train()
    val_auc = test(val_data)
    test_auc = test(test_data)
    if val_auc > best_val_auc:
        best_val_auc = val_auc
        final_test_auc = test_auc
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Val: {val_auc:.4f}, '
          f'Test: {test_auc:.4f}')

print(f'Final Test: {final_test_auc:.4f}')

z = model.encode(test_data.x, test_data.edge_index)
final_edge_index = model.decode_all(z)

This code performs link prediction using a Graph Convolutional Network (GCN) on the Cora dataset. Here’s an explanation of the key parts of the code:

Dataset Preparation:

The Planetoid dataset class from torch_geometric.datasets is used to load the Cora dataset.
The data is transformed using the RandomLinkSplit transform, which splits the graph into train, validation, and test sets and randomly adds negative samples for training.
The train, validation, and test data are extracted from the transformed dataset.

2. Model Definition:

The Net class defines the GCN model. It consists of two GCN layers (GCNConv), which are used for encoding the node features.
The encode method applies the GCN layers to the input features and edge indices to obtain node embeddings.
The decode method calculates the link scores between nodes using the node embeddings.
The decode_all method computes the adjacency matrix of the graph based on the node embeddings.

3. Model Training:

The train function performs a single training iteration. It sets the model to training mode, resets the optimizer gradients, encodes the node features using the GCN layers, performs negative sampling to generate negative examples, concatenates positive and negative edge indices, creates edge labels, calculates the link scores using the node embeddings, computes the loss using the BCEWithLogitsLoss criterion, performs backpropagation, and updates the model parameters.
The test function evaluates the model on the provided data. It sets the model to evaluation mode, encodes the node features, calculates the link scores, applies the sigmoid function to obtain probabilities, and computes the ROC AUC score.

4. Train/Test Loop:

The code runs a loop for a specified number of epochs (100 in this case).
In each epoch, it trains the model on the training data, evaluates it on the validation and test data, and keeps track of the best validation and final test ROC AUC scores.
The loss, validation ROC AUC, and test ROC AUC are printed for each epoch.

Shallow Embedding for Multi-Relational Data and Knowledge Graphs

In the realm of graph-based machine learning, shallow embedding methods have proven to be effective in capturing the relationships between nodes. While we have previously explored shallow embedding (Node embeddings) for simple graphs. this section delves deeper into the exciting world of multi-relational graphs. We will explore the challenges associated with multi-relational data and knowledge graphs, discuss the concept of knowledge graph completion, and present various approaches to shallow embedding in this context.

Understanding Multi-Relational Graphs and Knowledge Graph Completion

Multi-relational graphs are complex networks consisting of nodes and edges, where each edge represents a specific relation between two nodes. These graphs are often referred to as knowledge graphs since they encode factual information as tuples of the form (u, τ, v), signifying a relation τ between nodes u and v. For instance, in a biomedical knowledge graph, an edge (u, TREATS, v) could indicate that the drug associated with node u treats the disease associated with node v. The primary task in knowledge graph completion is to predict missing edges or relations within the graph.

Reconstructing Multi-Relational Data

Similar to the case of simple graphs, embedding multi-relational graphs can be viewed as a reconstruction task. The goal is to learn low-dimensional embeddings for nodes that can accurately reconstruct the relationships between them. However, in multi-relational graphs, we face the additional challenge of dealing with diverse types of edges or relations.

To tackle this challenge, we enhance the decoder component of our model to make it multi-relational. Instead of solely considering pairs of node embeddings, the decoder now takes into account both node embeddings and relation types. This enables the model to predict the likelihood of the existence of an edge between two nodes, given their embeddings and the relation type. Various decoder functions have been proposed, including RESCAL, TransE, TransX, DistMult, ComplEx, and RotatE. Each decoder offers a unique approach to encoding and decoding relations between nodes.

Loss Functions for Multi-Relational Node Embedding

Choosing an appropriate loss function is crucial for training the model in multi-relational node embedding. One popular and efficient choice is the cross-entropy loss with negative sampling. This loss function compares the predicted probabilities of true and false edges in the graph. By sampling negative examples and employing logistic regression, the model can learn to differentiate between true and false relations more effectively.

Good resources

PyTorch Geometric Graph Embedding

Using SAGEConv in PyTorch Geometric module for embedding graphs

towardsdatascience.com

References

[1] Graph Representation learning, William

[2] Pytorch Geometric: pytorch_geometric/link_pred.py at master · pyg-team/pytorch_geometric · GitHub