Neighborhood Reconstruction with DeepWalk to Create Node Embeddings

AMA
13 min readMar 25, 2024

--

In the previous article, we covered some of the foundational concepts about graph properties and concepts, and I hope you are as excited as I am to explore one of the most groundbreaking works in the field of Graph Neural Networks as we dive into the building blocks of DeepWalk, but first, what is Neighborhood reconstruction.

What is Neighborhood Reconstruction?

Neighborhood reconstruction methods aim to learn node embeddings by capturing the local graph structure and relationships between nodes. These methods employ an encoder-decoder framework, where an encoder(ENC) maps nodes to low-dimensional embeddings, and a decoder reconstructs information about each node’s neighborhood in the original graph. We will just provide a brief overview of the different methods below, and dive into DeepWalk which is the focus of this article, please feel free to explore the reference section.

Figure 1 — Node embedding illustration which maps nodes to a low-dimensional embedding space.

Neighborhood reconstruction methods

1. An Encoder-Decoder Perspective: The concept of encoding and decoding graphs in the context of node embeddings has been evolving. The framework of encoding and decoding graphs involves mapping nodes to low-dimensional embeddings and using a decoder to reconstruct information about each node’s neighborhood.

2. Factorization-based approaches: This includes methods like Laplacian one of the earliest factorization-based approaches introduced by Belkin and Niyogi in 2002 eigenmaps and inner-product methods which optimize embeddings to capture node similarity based on adjacency or higher-order neighborhood information. Other methods like Graph Factorization, GraRep, and HOPE have been developed more recently, with publications ranging from 2013 to 2016.,

3. Random walk embeddings: Explores methods like DeepWalk introduced by Perozzi et al. in 2014, pioneered the use of random walks for learning node embeddings and node2vec proposed by Grover and Leskovec in 2016, which learn embeddings by capturing the statistics of random walks over the graph, enabling the modeling of node co-occurrence patterns.

Let me know if you need further clarification or assistance!

Before diving into DeepWalk I would like to motivate your learning by illustrating the problem DeepWalk aims to solve, which is graph representation using the Zachary Karate Club dataset which is to GNN as “Hello world” is to programming or as MNIST (LeCun et al) is to computation vision.

Figure 2 — Zachary’s Karate Club

If you’ve been studying GNNs then obviously Zachary’s Karate Club needs no introduction, and for those without prior knowledge. The Zachary Karate Club dataset is a well-known social network dataset that captures the interactions among members of a university karate club. It was originally compiled by Wayne W. Zachary, a social psychologist, in the early 1970s.

In 1977, Zachary published a paper titled “An Information Flow Model for Conflict and Fission in Small Groups,” in which he analyzed the dynamics of the karate club he had been observing. The karate club experienced a significant event during Zachary’s study: a conflict between the instructor (Mr. Hi) and the club president (John A.). This conflict ultimately led to the club splitting into two separate factions, with members aligning themselves either with the instructor or the president.

Zachary’s dataset captures the social interactions between club members before and after the split.

As seen in the illustration of the graph above, while it is easy to visualize the graph with connected nodes bringing it to low dimensional space is a complex task given the sparse nature of the graph. Thus how do we capture this information on node connectivity in the graph given the sparse nature of the nodes? This is where DeepWalk comes in!

What is DeepWalk?

DeepWalk is a graph representation algorithm embedding algorithm that captures nodes’ connectivity into a lower dimensional space for downstream machine learning tasks (classification) using Word2Vec and Random Walks.

DeepWalk was published by Perozzi et al.( 2014) using some inspiration from the advances in NLP, notably Word2Vec to perform graph modeling. Thus, the main idea of DeepWalk is to use similar techniques used in language modeling to model graphs.

Word2Vec for Language Modelling

I first came across Word2Vec during my Master’s dissertation when I was trying different embedding methods for sentiment analysis on the Beauty Product Review Dataset and surprisingly a friend of mine was working on a Large Language problem too.

Word2Vec is a language modeling technique proposed by Tomas et al (2013) used to translate words into vectors known as embeddings from large text datasets. The text tokens are represented by one-hot encoded vectors.

The paper proposed two methods of sampling which are CBOW(Continuous Bag of Words) and the Skip-Gram model.

  • Continuous bag-of-words model: predicts the middle word based on surrounding context words. The context consists of a few words before and after the current (middle) word. This architecture is called a bag-of-words model as the order of words in the context is not important.
  • Continuous skip-gram model: predicts words within a certain range before and after the current word in the same sentence. A worked example of this is given below.
Figure 3 — CBoW and Skip-Gram Architectures

Our focus would be on the Skip-gram model because this is what is used in the Deep Walk model.

How skip-gram model works

Skip-gram is used to produce high-quality word embeddings, by training a task to predict the correct context words given a target word. Thus the training objective of the model is to maximize the probability of predicting context words given a target word. In a case where we have a sequence of words w1, w2,…, wT, the objective can be written as the average log probability.

Where C is the size of the training context. The softmax function is used in the basic skip-gram model used the softmax function for the calculate the probability of context word embedding given a target word embedding

where v and v ’ are the target and context vector representations of words and W is the vocabulary size.

Here’s how the Skip-Gram model works:

  • Define a window size (e.g., 2) to specify the number of context words to consider.
  • For each target word in the training corpus, select a window of context words.
  • Feed the target word as input to the Skip-Gram model, which consists of an input layer, a hidden layer, and an output layer.
  • The input layer represents the target word as a one-hot vector.
  • The hidden layer learns to represent the target word as a dense vector.
  • The output layer predicts the context words based on the target word.
  • The model is trained to minimize the difference between the predicted context words and the actual context words, using back-propagation and stochastic gradient descent.
Figure 4— Word2Vec Architecture

For a practical illustration of how Word2Vec works, we can use the gensim library which is also used in the official implementation of DeepWalk, though you can also find implementations in tensorflow with keras or Pytorch. I am a big fan of Josh Stammer from StatQuest and I hope to make this short implementation fun like his Youtube videos which is to keep it simple and short.

# Step 1: Install gensim and import Word2Vec class
!pip install -qU gensim
from gensim.models.word2vec import Word2Vec

# initialize text
text = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc eu sem scelerisque, dictum eros aliquam, accumsan quam. Pellentesque tempus, lorem ut semper fermentum, ante turpis accumsan ex, sit amet ultricies tortor erat quis nulla. Nunc consectetur ligula sit amet purus porttitor, vel tempus tortor scelerisque. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Quisque suscipit ligula nec faucibus accumsan. Duis vulputate massa sit amet viverra hendrerit. Integer maximus quis sapien id convallis. Donec elementum placerat ex laoreet gravida. Praesent quis enim facilisis, bibendum est nec, pharetra ex. Etiam pharetra congue justo, eget imperdiet diam varius non. Mauris dolor lectus, interdum in laoreet quis, faucibus vitae velit. Donec lacinia dui eget maximus cursus. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Vivamus tincidunt velit eget nisi ornare convallis. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Donec tristique ultrices tortor at accumsan.
""".split()

# Step 2: Initialize a skip-gram Word2Vec model
model = Word2Vec([text],
sg=1, # Skip-gram
vector_size=10,
min_count=0,
window=2,
workers=2,
seed=0)

# Step 3: Check the shape of the weight matrix
print(f'Shape of W_embed: {model.wv.vectors.shape}')

# Step 4: Train the model for 10 epochs
model.train([text], total_examples=model.corpus_count, epochs=10)

# Step 5: Print a word embedding to see the result
print('Word embedding =')
print(model.wv[0])

# Step 6: Activate H-Softmax in gensim
model_hs = Word2Vec([text],
sg=1, # Skip-gram
hs=1, # H-Softmax
vector_size=10,
min_count=0,
window=2,
workers=2,
seed=0)

output:

Word embedding =
[ 0.07156403 0.03257632 0.00209916 -0.04374931 -0.03398107 -0.08656936
-0.09047253 -0.0955243 -0.06482638 0.0660186 ]

This implementation covers the basic steps of initializing, training, and configuring the Word2Vec model for DeepWalk using Gensim, but it is worth noting that softmax can be costly to train on larger vocabulary size like hundreds of words and that is why H-softmax was proposed. Step 6 shows us how to activate H-Softmax by setting hs=1 in the Word2Vec initialization, which enables the use of hierarchical softmax for word prediction.

DeepWalk

As noted earlier this was proposed by folks from Google in 2014, and ever since has become very popular. Though a lot of advances have been made since then, it remains a relatively easy and reliable baseline for its ease of implementation. With inspiration from NLP, in which language models, words are represented as sparse vectors, with mostly zeros and a one for the word being represented. Similarly, in graphs, vertices and their connections are encoded, where connected vertices have ones and non-connected ones have zeros. However, because graphs do not have a sequential pattern like sentences and hence transforming sparse graph representations into dense feature spaces is done using DeepWalk

The core idea of DeepWalk is to take context windows of vertices in the graph, akin to context pairs in sentences. By traversing the graph via random walks, sequences similar to sentences are constructed, capturing the graph’s structure and relationship between vertices based on the concept of network homophily hypothesis( Nodes close to each other are similar).

The algorithm slides a context window over the graph and uses a Skip Gram model from Word2Vec illustrated previously to encode vertices into a low-dimensional space. This transformation aims to preserve the shortest path distance between nodes and their labels, typically representing community memberships. Thus in the case of graphs, rather than words we have nodes as our data, and this is the reason why we use random walks.

Figure 5 — Sentences represented as graphs, image source Maxime (2023)

Step-by-step Implementation of Random Walk

  1. Graph Generation: A random graph with 10 nodes is generated using the Erdos-Renyi model, which creates edges between nodes with a probability of 0.3. This graph is visualized using matplotlib, providing a visual representation of the randomly generated structure.
  2. Random Walk Implementation: A function named random_walk is defined, which simulates a random walk on the graph. It takes two parameters: the starting node (start) and the length of the walk (length). At each step, the function randomly selects a neighboring node from the current node's neighbors using np.random.choice and appends it to the walk. This process continues until the specified length of the walk is reached.
  3. Execution of Random Walk: The random_walk function is called with a starting node of 0 and a walk length of 10. The resulting walk sequence is printed, showing the nodes visited during the random walk.
  4. Analysis: The resulting list from the random walk reveals the sequence of nodes visited, indicating how certain nodes are often found together. This observation, such as nodes 0 and 9 appearing together, suggests similarities between these nodes in the graph’s structure. This homophilic relationship between nodes is what DeepWalk aims to capture, making it suitable for graph-based representation learning tasks.

Below is the code implementation of the

# Import required libraries
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import random

# Initialize random number generator for reproducibility
random.seed(0)

# Generate a random graph with 10 nodes and probability of edge creation as 0.3
G = nx.erdos_renyi_graph(10, 0.3, seed=1, directed=False)

# Plot the random graph
plt.figure(dpi=300)
plt.axis('off')
nx.draw_networkx(G,
pos=nx.spring_layout(G, seed=0),
node_size=600,
cmap='coolwarm',
font_size=14,
font_color='white'
)
plt.show()

# Function to implement random walks on the graph
def random_walk(start, length):
# Initialize the walk with the starting node
walk = [str(start)]

# Perform random walk for specified length
for i in range(length):
# Get neighbors of the current node
neighbors = [node for node in G.neighbors(start)]

# Randomly select the next node from neighbors
next_node = np.random.choice(neighbors, 1)[0]

# Append the next node to the walk
walk.append(str(next_node))

# Update the current node for the next step
start = next_node

return walk

# Execute random walk with starting node 0 and length 10
print(random_walk(0, 10))

Output:

['0', '1', '0', '9', '4', '5', '6', '3', '6', '7', '3']

As we can observe nodes 0 and 3 are frequently found next to each other, and according to the homophily hypothesis means that they are similar.

Implementation of DeepWalk

We now have all the building blocks to implement DeepWalk, and to proceed to solve the problem illustrated earlier with the Zachary Karate Club. Thus, using Random walk and Word2Vec with Skip-gram, we would create node embeddings of the karate Club and subsequently use an ML algorithm for a classification task.

We would proceed to explore Zachary’s Karate Club, a realm of connections and affiliations studied by the intrepid Wayne W. Zachary in the 1970s. This club, studied by Wayne W. Zachary in the 1970s, offers insights into how people interact and form groups.

I. Getting Started with Random Walk:

G = nx.karate_club_graph()
labels = []
for node in G.nodes:
label = G.nodes[node]['club']
labels.append(1 if label == 'Officer' else 0)

Let’s proceed to visualize the network:

plt.figure(figsize=(12,12), dpi=300)
plt.axis('off')
nx.draw_networkx(G,
pos=nx.spring_layout(G, seed=0),
node_color=labels,
node_size=800,
cmap='coolwarm',
font_size=14,
font_color='white'
)
plt.show()

Exploring Connections: Now, let’s explore how members interact within the club. We take walks through the network, randomly moving from one member to another:

walks = []
for node in G.nodes:
for _ in range(80):
walks.append(random_walk(node, 10))

We can peek into one of these walks:

print(walks[0])

This walk shows the path taken by a member through the club.

II. Modeling the Network with Word2Vec

To understand the relationships better, we use Word2Vec. It helps us find patterns in the walks we took:

model = Word2Vec(walks,
hs=1, # Hierarchical softmax
sg=1, # Skip-gram
vector_size=100,
window=10,
workers=2,
seed=0)

We train our model to learn from the walks:

model.train(walks, total_examples=model.corpus_count, epochs=30, report_delay=1)

Finally, let’s see what our model tells us. We can find which members are most similar to a chosen one:

print('Nodes that are the most similar to node 0:')
for similarity in model.wv.most_similar(positive=['0']):
print(f' {similarity}')

Output: Below we have the output of nodes that are most similar to node 0.

Nodes that are the most similar to node 0:
('17', 0.6424034237861633)
('4', 0.5844610929489136)
('10', 0.5792854428291321)
('3', 0.5659441351890564)
('11', 0.5636734366416931)
('5', 0.5312074422836304)
('6', 0.5308884978294373)
('7', 0.5130414962768555)
('1', 0.5081695318222046)
('16', 0.5029702186584473)

Now, let’s take our exploration further by visualizing node classes and using machine learning to classify them. To begin, let’s visualize the high-dimensional embeddings of our nodes in 2D using t-SNE. This technique helps us see how nodes are grouped based on their similarities.

III. Visualizing Node Classes with t-SNE

from sklearn.manifold import TSNE

# Create arrays for embeddings and labels
nodes_wv = np.array([model.wv.get_vector(str(i)) for i in range(len(model.wv))])
labels = np.array(labels)

# Train t-SNE model
tsne = TSNE(n_components=2,
learning_rate='auto',
init='pca',
random_state=0).fit_transform(nodes_wv)

# Plot the 2D vectors with labels
plt.figure(figsize=(6, 6), dpi=300)
plt.scatter(tsne[:, 0], tsne[:, 1], s=100, c=labels, cmap="coolwarm")
plt.show()

We can observe from the plot that they’re are two classes with a line separating both, giving us valuable insights into the structure of our network.

Figure 6 — t-SNE plot of Nodes in Zarachy’s Karate Club

Given that we already have a representation on the network in a lower dimensional representation we can proceed to use ML algorithms for downstream tasks such as a classification task. This brings us to our original problem at the beginning of this article with the Karate Club which motivated you to want to learn more about DeepWalk.

IV. Classifying Node Classes with Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Define train and test data masks
train_mask = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
test_mask = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 30, 31, 32, 33]

# Train Random Forest classifier
clf = RandomForestClassifier(random_state=0)
clf.fit(nodes_wv[train_mask], labels[train_mask])

# Evaluate classifier
y_pred = clf.predict(nodes_wv[test_mask])
accuracy = accuracy_score(y_pred, labels[test_mask])

print(f"Accuracy Score: {accuracy}")

Our classifier achieves an impressive accuracy score of 94.73%, showcasing the effectiveness of DeepWalk in classifying nodes based on their embeddings. This demonstrates the power of DeepWalk in both unsupervised learning, by discovering similarities between nodes, and supervised learning, by using embeddings for classification tasks.

In the previous article, we developed our foundational knowledge of some of the concepts of GNN, and in this article, we dove into the architecture of DeepWalk a foundational graph representation algorithm. In the next article, we’ll proceed to leverage our understanding of DeepWalk to discover interesting ways to improve the quality of our embeddings with biased random walks leading to Node2Vec. I guess you are as excited as myself but it’s worth going for a walk now and later jump to Youtube to watch a video about Node2Vec as you patiently wait for the next episode of our journey into GNN. 🚀

Previous article: Exploring Graph Properties and Their Role in Graph Neural Networks

References:

  1. Grover, A., & Leskovec, J. (2016). Node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16) (pp. 855–864). ACM.

2. LeCun et al. (1998) Gradient-based learning applied to document recognition, source: https://ieeexplore.ieee.org/document/726791

3. Maxime L. (2023) Hands-on Graph Neural Networks Using Python, Chapter 3: Creating Node Representation with DeepWalk,

3. TensorFlow: Word2Vec source, https://www.tensorflow.org/text/tutorials/word2vec

4. Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’14) (pp. 701–710). ACM.

5. William L. (2020), Graph Representation Learning; Source: https://www.cs.mcgill.ca/~wlh/grl_book/files/GRL_Book.pdf

6. Zachary’s Karate club, Wikepedia, source: https://en.wikipedia.org/wiki/Zachary%27s_karate_club

7. Implementing word2vec in Pytorch, Mushai W. (2021) source: https://muhark.github.io/python/ml/nlp/2021/10/21/word2vec-from-scratch.html

Code:

Google collab: https://colab.research.google.com/drive/1T47tz2PreecUNktRAk-TAwmeFGbDSrEy?usp=sharing

Abbreviations:

MNIST: Modified National Instute of Standards and Technology

NLP: Natural Language Processing

--

--

AMA

Data and ML engineer, I love exploring data and building ML systems to production. Love animes, books, music and having deep conversations.