Graph Representational Learning: Creating node and graph embeddings

11 min readJun 20, 2024

--

In the previous blog post, we covered traditional methods for extracting features from input graphs. In this post, we’ll delve into various approaches for generating node and graph-level embeddings. This includes techniques such as DeepWalk and Node2Vec for node embeddings, as well as Anonymous Walk for graph-level embeddings. Let’s dive in!

Node Embedding

The primary challenge with traditional hand-crafted features lies in their time-consuming creation process and their task-specific nature. Consequently, there is a pressing need for a more efficient method to encode input features. What are our next steps?

In natural language processing, concepts like learned embeddings such as Word2Vec are well-established. We can apply a similar approach here. Let’s delve into this further.

1. Encoder — Decoder Approach

The primary objective of this approach is to generate node-level embeddings in such a way that nodes with similar characteristics are represented by embeddings that are closer to each other, while embeddings of dissimilar nodes are farther apart.

Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014

Let’s explore the process to achieve this. Assume we have a graph G with:

V as the vertex set,
A as the adjacency matrix.

The objective is to encode nodes in such a way that their similarity in the embedding space (e.g., measured by dot product) reflects their similarity in the graph structure.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Process:

The encoder ENC maps nodes to embeddings.
Define a node similarity function, which measures similarity in the original network.
The decoder DEC maps embeddings to similarity scores.
Optimize the parameters of the encoder to ensure embeddings reflect the node similarities captured by the decoder.

similarity(u, v) ≈ zᵤᵀzᵥ

Let’s clarify the roles of the encoder and the similarity function mentioned earlier:

Encoder: The encoder’s role is to transform each node into a low-dimensional vector representation, known as an embedding.

Similarity Function: Specifies how the relationships in the vector space correspond to those in the original network. It plays a key role in the decoder section of the approach.

As for initializing these embeddings, a common approach is:

Shallow Encodings: A straightforward approach is to treat the encoder as an embedding lookup.

Each node is assigned a unique embedding vector, meaning we directly optimize the embedding for each individual node.

2. Random Walk Approach

Given a graph and a starting point, we randomly select a neighbor of the starting point and move to this neighbor. We continue this process by selecting random neighbors of the current point and moving accordingly. This sequence of points visited in this manner constitutes a random walk on the graph.

zᵤᵀzᵥ ≈ probability that u and v co-occur on a random walk over the graph

Steps:

Estimate the probability of visiting node v on a random walk starting from node u using a specific random walk strategy R.
Optimize embeddings to encode these statistics derived from random walks.

We will explore how to optimize the embeddings. Before that, let’s delve into why we employ the Random Walk approach.

Expressivity: This method offers a flexible stochastic definition of node similarity, integrating both local and higher-order neighborhood details. The idea is that if a random walk originating from node 𝒖 frequently visits 𝒗, then 𝒖 and 𝒗 are deemed similar, leveraging high-order multi-hop information.
Efficiency: By focusing solely on pairs of nodes that co-occur during random walks, we eliminate the need to consider all possible node pairs during training.

Next, let’s explore the process of learning the embeddings. Clearly, from its description, this process is an unsupervised feature learning procedure.

Intuition: The goal is to find embeddings of nodes in a d-dimensional space that preserve their similarities.
Idea: The objective is to learn node embeddings such that nodes that are close in the network are also close in the embedding space.
For a given node 𝑢, the notion of nearby nodes is defined by 𝑁ᵣ(u), the neighborhood of 𝑢 obtained through some random walk strategy 𝑟.
Given a graph 𝐺 = (𝑉, 𝐸), our aim is to learn a mapping 𝑓: 𝑢 → ℝᴰ, where 𝑓₍ᵤ₎ = 𝐳ᵤ, representing the embedding vector of node 𝑢.

The objective function is structured as follows:

Given node 𝑢, our aim is to learn feature representations that predict nodes within its random walk neighborhood 𝑁ᵣ(u).

Optimization Steps:

Conduct short fixed-length random walks starting from each node 𝑢 in the graph using a specified random walk strategy R.
For each node 𝑢, gather 𝑁ᵣ(𝑢), which represents the multiset* of nodes visited during random walks initiated from 𝑢.
Optimize embeddings based on the following principle: Given a node 𝑢, predict its neighbors 𝑁ᵣ(u).

which is equivalent to:

Intuition: We optimize embeddings 𝑧ᵤ to maximize the likelihood of co-occurrences in random walks.

We parameterize 𝑃(𝑣|𝐳ᵤ) using softmax.

Why softmax? We aim for node 𝑣 to be most similar to node 𝑢 among all nodes 𝑛.

Intuition: ∑ᵢ exp(xᵢ) ≈ maxᵢ exp(𝑥ᵢ)

Now, let’s combine all these insights. The overall equation looks like this:

Therefore, we can conclude that:

Optimizing random walk embeddings involves finding embeddings 𝐳ᵤ that minimize L.

However, when dealing with graphs of immense size, optimizing with softmax, which includes a costly normalizing term, becomes impractical. So, what is the solution for this?

This is where Negative Sampling comes into play. First, let’s explain what negative sampling is, and then we’ll explore how to apply it for our purposes.

Negative Sampling: Negative sampling is a method in machine learning, often used in neural networks like word embeddings, where instead of learning from all possible examples, the model trains on a subset of “negative” examples alongside positive ones to improve efficiency and performance.

Now, let’s discuss, how to use it in our case.

Sample 𝑘 negative nodes, each selected with a probability proportional to its degree. There are two considerations regarding 𝑘 (# of negative samples):

A higher 𝑘 provides more robust estimates.
However, a higher 𝑘 also introduces a higher bias towards negative events. In practice, 𝑘 typically ranges from 5 to 20.

After formulating the objective function, the next step is to optimize (minimize) it. This is usually accomplished using iterative optimization methods such as stochastic gradient descent (SGD) or its variants, which adjust the embeddings to gradually reduce the objective function’s value until convergence.

Gradient Descent: a straightforward method to minimize ℒ:

Initialize 𝑧ᵢ to random values for all 𝑖.
Iterate until convergence.

a. Compute the derivative ∂ℒ/∂zᵢ for all 𝑖.

b. Update each 𝑧ᵢ by taking a step in the direction of its derivative:

That’s the essence of how we perform random walks over graphs to learn embeddings for them. The DeepWalk concept is built upon this intuition. Let’s delve into it a bit.

3. DeepWalk

DeepWalk is a method for learning representations of nodes in a graph by leveraging techniques from natural language processing. It uses the above explained random walk technique to generate sequences of nodes in the graph and then applies a Skip-gram model (similar to word2vec) to learn embeddings that capture the structural context of nodes.

By treating nodes as words and sequences of nodes as sentences, DeepWalk learns distributed representations that encode the graph’s topology and connectivity patterns. These embeddings can be used for various tasks such as node classification, link prediction, and community detection in complex networks.

4. Node2Vec

Node2vec draws inspiration from the Random Walk approach. The key distinction lies in its use of biased walks rather than random walks to generate and learn embeddings. This approach aims to employ flexible biased random walks that balance between local and global perspectives of the network.

In the image above, two strategies are employed to calculate walks of length 3. The first strategy, Breadth First Search (BFS), provides a local microscopic view by traversing neighboring nodes. The second strategy, Depth First Search (DFS), offers a global macroscopic view by exploring distant nodes within the graph.

iased fixed-length random walk 𝑹, given a node 𝒖, generates a neighborhood 𝑵𝑹 𝒖. Two key parameters are involved:

Return parameter 𝒑: Determines the likelihood of returning to the previous node.
In-out parameter 𝒒: Balances the movement towards neighboring nodes (DFS) versus returning to nodes already visited (BFS). Intuitively, 𝒒 represents the “ratio” of BFS to DFS behaviors.

Biased 2nd-order random walks explore network neighborhoods in the following manner:

After traversing the edge (𝑠7, 𝑤), the random walk is now at node 𝑤.
Insight: The neighbors of 𝑤 can only be:

Make sure to remember the origin of the walk.

𝑝, 𝑞 model transition probabilities
- 𝑝 … return parameter
- 𝑞 … ”walk away” parameter

Walker came over edge (𝐬𝟏, 𝐰) and is at 𝐰. Where to go next?

BFS-like walk: Low value of 𝑝
DFS-like walk: Low value of 𝑞

𝑁ᵣ(𝑢) are the nodes visited by the biased walk.

Algorithm:

Compute random walk probabilities.
Simulate 𝑟 random walks of length 𝑙 starting from each node 𝑢.
Optimize the node2vec objective using Stochastic Gradient Descent.

Advantages:

Linear-time complexity
All 3 steps are individually parallelizable

Note: It has been observed in different works that node2vec performs better on node classification while alternative methods perform better on link prediction.

Graph Embeddings

The first two approaches are fundamental, so I’ll provide a brief overview without diving into details. Then, we’ll discuss popular graph embedding algorithms such as Anonymous Walk Embeddings and Graph2Vec.

1. First Approach:

Apply a standard graph embedding technique to the (sub)graph 𝐺.
Then simply aggregate the node embeddings in the (sub)graph 𝐺 by summing or averaging them.

2. Second Approach

Introduce a “virtual node” to represent the (sub)graph and apply a standard graph embedding technique to it.

3. Anonymous Walk Embeddings

States in anonymous walks correspond to the index representing the first visit to each node during a random walk.

Steps:

Simulate anonymous walks of length 𝑙 and record their frequencies.
Represent the graph using a probability distribution based on these walks.

For example:

Let 𝑙 = 3
We represent the graph with a 5-dimensional vector corresponding to the 5 anonymous walks 𝑤# of length 3: 111, 112, 121, 122, 123.
𝑍𝓰[𝑖] = probability of anonymous walk 𝑤5 in 𝐺

Now, let’s discuss how to sample the anonymous walks:

Generate a set of 𝑚 independent random walks.
Represent the graph using a probability distribution based on these walks.
How many random walks 𝑚 do we need?
- We require 𝑚 such that the distribution’s error is less than 𝜀 with a probability greater than 𝛿.

The previous approach represents each walk by its frequency of occurrence. What if, instead, we could learn embeddings zᵢ for each anonymous walk wᵢ?

We aim to learn a graph embedding 𝒁𝓰 along with embeddings 𝒛ᵢ for all anonymous walks:
- 𝒁 = {𝒛ᵢ : 𝑖 = 1 ... 𝜂}, where 𝜂 is the number of sampled anonymous walks.
How do we embed walks?
- Embed walks so that the subsequent walk can be predicted.
We use a vector parameter 𝒁𝓰 for the input graph.
- The entire graph's embedding is to be learned.
Starting from node 1: Sample anonymous random walks. For example:

5. Learn to predict walks that co-occur within a Δ-size window (e.g., predict 𝑤; given 𝑤<, 𝑤= if Δ = 1).

6. Sum the objective across all nodes in the graph.

7. Conduct 𝑻 different random walks from 𝒖, each of length 𝒍.

8. Learn to predict walks that co-occur within a Δ-size window.

9. Estimate the embedding 𝑧5 of the anonymous walk 𝑤5. Let 𝜂 be the number of all possible walk embeddings.

10. After optimization, we obtain the graph embedding 𝒛𝑮 (a learnable parameter).

11. Utilize 𝒁𝓰 to make predictions, such as graph classification:
— Option 1: Inner product kernel 𝒁ᵀ𝓰₁ 𝒁𝓰₂.
— Option 2: Employ a neural network that takes 𝒁𝓰 as input for classification.

4. Graph2Vec

Graph2Vec is a method for learning fixed-size embeddings (vector representations) of entire graphs, which encapsulate the structural and topological properties of the graphs. Unlike traditional node embeddings that represent individual nodes in isolation.

Steps to calculate Graph2Vec embeddings:

Graph Representation:
- Input: Start with a dataset consisting of multiple graphs G={G₁, G₂, …, Gₙ}
- Each graph Gᵢ is represented as a structured entity with nodes, edges, and optionally node attributes and edge weights.
Feature Extraction:
- Extract meaningful features from each graph Gᵢ. This typically involves capturing global and local structural characteristics of the graph.
- Example features could include node degree distribution, graph centrality measures, subgraph statistics, or any other relevant graph properties.
Subgraph Sampling (Neighborhood Sampling): For each graph Gᵢ
- Random Walks or Neighborhood Sampling: Perform random walks or other neighborhood sampling techniques to extract subgraphs (neighborhoods) around each node within Gᵢ.
- Context Extraction: Collect these sampled subgraphs to use as local contexts for learning embeddings.
Graph Embedding Learning:
- Embedding Model: Utilize a model inspired by Skip-gram (used in word2vec)
- Objective: Train the model to predict the local contexts (sampled subgraphs) given a central subgraph (target) within each graph Gᵢ
- Loss Function: Define a loss function (typically cross-entropy or negative sampling based) to optimize the model parameters.
Training:
- Optimization: Use stochastic gradient descent (SGD) or other optimization techniques to minimize the loss function.
- Iterations: Iterate over the dataset multiple times (epochs) to improve the quality of the embeddings.
- Hyperparameters: Adjust hyperparameters such as embedding dimensions, window size (for context sampling), learning rate, and number of iterations based on validation performance.
Graph2Vec Embeddings:
- Output: After training, obtain fixed-size vector representations (embeddings) for each graph Gᵢ.
- Each embedding vᵢ captures the structural and topological properties of the corresponding graph Gᵢ.

So, that it for graph representational learning. There are a couple of other algorithms which I haven’t discussed. Feel free to explore more or comment about them. Some of them are LINE, NetMF etc.