Footprints in Graphs: The Enchantment of DeepWalk

mayank khulbe
The Good Food Economy
11 min readMar 9, 2024

--

Initially, let’s set the stage through a captivating story.

In the bustling streets of a virtual city, Alex, a digital wanderer, took deliberate steps through the intricacies of interconnected pathways. Each choice at a crossroad led to new encounters and unexpected discoveries, mimicking the unpredictable nature of a random walk. This journey through the virtual alleys reflects the essence of graph theory, a mathematical framework that explores the relationships between nodes and edges within complex networks.

Enters the concept of a random walk, where one explores a network by moving from node to node in a stochastic manner, is fundamental to understanding the patterns within these intricate systems.

Now, imagine if Alex’s exploration could be not just random but purposeful, revealing hidden structures and connections intentionally. Here enters DeepWalk, a groundbreaking algorithm in graph theory, that systematically explores networks, unveiling latent relationships and capturing the essence of structural information.

In this blog, we embark on a journey inspired by Alex’s virtual footsteps. We’ll delve into the intricacies of random walks in graph theory and explore the revolutionary DeepWalk algorithm and its transformative impact on our understanding of interconnected nodes and edges. Also, we will use Python to understand graphs and finally implement DeepWalk using Python.

Let’s commence our exploration by delving into the foundations of graph theory.

Understanding Graph Theory

Graphs in Graph Theory

At its core, graph theory unveils the secrets hidden within networks, where entities are represented by nodes/vertices and the relationships between them are defined by edges.

Imagine nodes as distinct points, each encapsulating an entity or element in our network. These nodes symbolise anything from individuals in a social network to locations in a city grid or even molecules in a chemical compound. Now, envision edges as the threads weaving connections between these nodes, representing the relationships that bind them together.

Directed and Undirected graphs

In graph theory, we have two types of graphs namely directed and undirected graphs.

A directed graph, or digraph, is a type of graph where edges have a direction or orientation. Each edge connects two nodes but has a specific starting and ending point, indicating a one-way relationship. This directionality is often represented by arrows.

On the other hand, an undirected graph is a type of graph where edges have no direction. The relationships between nodes are symmetric, meaning that if node A is connected to node B, then node B is also connected to node A. Undirected graphs are often used to represent symmetrical relationships like friendships or connections in social networks.

Directed and Un-directed graph

The depicted image effectively illustrates the distinction between directed and undirected graphs. In the context of social connections, the relationship between User 1 and User 2 on Instagram, where User 1 follows User 2, is asymmetrical. This asymmetry signifies a directed graph, as the relationship is not necessarily reciprocated.

Conversely, in the case of Facebook friendships between User 1 and User 2, the bidirectional nature of the connection is apparent. If User 1 is friends with User 2, the reverse is also true. This mutual relationship characterises an undirected graph, emphasising the symmetry inherent in the connections within the network.

Understanding Skip-Gram

Curiosity may arise regarding Skip-Gram’s significance, as outlined in the paper, within the realm of graph theory. What makes the skip-gram model a crucial player in deciphering the intricate dance of interconnected nodes is the fact that nodes that are closer to each other in a graph are considered more similar in a learned embedding space.

Intrigued by how this transpires? You’re in for a treat as we navigate through the intricacies of this blog, leaving no stone unturned. So, sit back, relax, and let’s embark on a journey that unravels the magic within interconnected nodes.

CBOW and Skip-gram

At its core, skip-gram is a type of neural network architecture that learns to predict the context (surrounding nodes) of a given node within a graph. In the context of DeepWalk, skip-gram endeavours to capture the essence of structural information by navigating through the connections. In other words, Skip-gram is used to represent each node with an embedding of n-dimensions which are learned through the model training such that nodes that are closer in a graph are considered more similar.

If you’re eager to delve deeper into the magic of Skip-Gram and are keen on crafting your own Skip-Gram model from the ground up, check out my previous blog where I guide you through the process of implementing Skip-Gram from scratch using Python.

Understanding DeepWalk

DeepWalk? Sounds very familiar to Random Walk, right? So is the intuition.

Random walk lies at the very core of the DeepWalk algorithm as it serves as the dynamic exploration mechanism, unravelling the hidden intricacies within the network. In the realm of DeepWalk, each random walk can be envisioned as a virtual journey undertaken from one node to another capturing the local neighbourhood structure and extracting valuable information about the relationships between nodes.

Theoretically, DeepWalk initiates by selecting a starting node within the graph and embarking on a random walk, traversing a path to one of its neighbouring nodes, chosen at random for each step. This iterative process continues until the desired length of the sequence is achieved. By incorporating these sequences into its learning process, DeepWalk, inspired by skip-gram, transforms the seemingly arbitrary movements of virtual wanderers into a structured approach for generating embeddings.

Karate Club Graph

The displayed image features a renowned graph derived from the Karate Club data. Observing the graph, it’s evident that nodes in close proximity within the graph also exhibit proximity in the embedding space.

Eager to experience this magic firsthand? Your anticipation is about to be fulfilled as we dive into the implementation of the DeepWalk algorithm in Python.

DeepWalk implementation in Python

Before we delve into the depth of DeepWalk, make sure that you have networkx library installed which is important for the creation, manipulation, and study of data structures for graphs, digraphs, and multi-graphs.

Let’s start by plotting a random graph in python

#importing the common libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from ipywidgets import interact, widgets
from sklearn.decomposition import PCA
import os
import json

#ignore the warnings
import warnings
warnings.filterwarnings('ignore')

#importing the library for graphs
import networkx as nx

# Generate an Erdos-Renyi graph
n = 30
p = 0.15
G = nx.erdos_renyi_graph(n=n, p=p, seed=10) #this graph takes in the number of nodes and the probability
nx.draw_networkx(G, nx.spring_layout(G, seed=0))
Random Graph

As we can see, we have generated a random graph using the function erdos_renyi_graph in the networkx library. This function creates a graph that takes 2 main parameters; ’n’ i.e. number of nodes and ‘p’ i.e. probability thus creating a graph with ’n’ number of nodes with ‘p’ probability of 2 nodes being connected through an edge.

Lower the value of ‘p’, less connected is the graph. Higher the value of ‘p’, denser is the graph.

Erdos renyi graph for extreme values of ‘p’

Now, once we have the graph ready. We will write a function to generate a random walk.

#function to create random walk
def generate_random_walk(start_node, walk_length, graph):
walk_sequence = [start_node]
for _ in range(walk_length):
neighbors = [neighbor_node for neighbor_node in graph.neighbors(start_node)]
if not neighbors:
break
next_node = np.random.choice(neighbors, 1)[0]
walk_sequence.append(next_node)
start_node = next_node
return walk_sequence

The generate_random_walk function simulates a random walk on a given graph. Taking a starting node, walk length, and the graph structure as input parameters, the function generates a sequence of nodes representing the random walk.

At each step, the function identifies the neighbours of the current node in the graph and randomly selects one as the next destination for the walk. This process repeats for the specified walk length. Let’s highlight a random walk in the graph (highlighted with green).

Highlighted random walk

Once we have the function to create the sequence handy, it is time to use the pre-trained Skip-gram model from gensim library. The model will be used to first create initial embedding for each node and train the model to learn the embedding so that nodes that are closer in the above graph are mapped closer in the embedding space.

#function to train word2vec
def train_word2vec(sequences, vector_size, graph):
model = Word2Vec(sentences=sequences,
vector_size=vector_size,
window=2,
sg=1,
min_count=1,
seed=1)
model.build_vocab(sequences)
model.train(sequences, total_examples=model.corpus_count, epochs=100, report_delay=1)

node_embeddings = {node: model.wv[node] for node in list(graph.nodes())}

return node_embeddings

The presented function follows a familiar pattern for training Skip-gram models commonly used in Natural Language Processing (NLP). In this context, the input sequences resemble tokenised text data typically processed in NLP tasks. The function employs the Word2Vec Skip-gram model, leveraging parameters such as vector size, window size, and seed for training on the given sequences. The resultant model is then used to generate embeddings for each node in the graph, forming a dictionary with nodes as keys and their corresponding embeddings as values.

Now, as you can see the size of the embedding vectors created is 100. So, a more pragmatic approach to visualisation involves reducing the dimensions to enhance interpretability. The primary motivation behind this reduction is to create a scatter plot that captures the intrinsic structure of the embeddings in a visually comprehensible manner. The ensuing scatter plot serves as a powerful visualisation tool, allowing for a concise yet informative representation of the graph embeddings in a reduced-dimensional space.

Any idea what we have next in the picture to reduce the dimensions of the embeddings? Right, it’s Principal Component Analysis.

So, let’s reduce to the embedding dimension from 100 to 2 and plot the embedding in the 2-D space using a scatter plot.

#reduce the dimension of embeddings from 100 to 2
def dimension_reduction_using_pca(node_embeddings):
pca = PCA(n_components=2)
node_embeddings_2d = pca.fit_transform(list(node_embeddings.values()))

return node_embeddings_2d

Now, since we have all the functions ready, let’s write a final function to create an interactive widget in Python to create the graph and visualise the embeddings of the nodes in a 2-D space.

def generate_graph_and_visualize(num_nodes, probability, walk_length, vector_size):
# Generate an Erdos-Renyi graph
G = nx.erdos_renyi_graph(num_nodes, probability, seed=10)

# Generate random walk sequences
sequences = []
for _ in range(100):
sequences.append(generate_random_walk(np.random.choice(num_nodes), walk_length, G))

# Train word2vec model
embeddings = train_word2vec(sequences, vector_size, G)

# Plot the graphs in a subplot
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot the original graph
axes[0].set_title("Erdos-Renyi Graph")
nx.draw_networkx(G, nx.spring_layout(G, seed=0), with_labels=True, ax=axes[0])

# Plot the node embeddings using PCA
axes[1].set_title("Node Embeddings from SkipGram Model (2D PCA)")
node_embeddings_2d = dimension_reduction_using_pca(embeddings)
for (node, _), (x, y) in zip(zip(G.nodes(), set(G.nodes())), node_embeddings_2d):
axes[1].scatter(x, y, color='blue')
axes[1].text(x, y, str(node), fontsize=8, ha='right', va='bottom')

plt.show()

# Create sliders
num_nodes_slider = widgets.IntSlider(value=30, min=10, max=50, step=1, description='Num Nodes:')
probability_slider = widgets.FloatSlider(value=0.15, min=0, max=1, step=0.01, description='Probability:')
walk_length_slider = widgets.IntSlider(value=10, min=1, max=20, step=1, description='Walk Length:')
vector_size_slider = widgets.IntSlider(value=100, min=10, max=200, step=10, description='Vector Size:')

# Create the interactive function
interact(generate_graph_and_visualize,
num_nodes=num_nodes_slider,
probability=probability_slider,
walk_length=walk_length_slider,
vector_size=vector_size_slider);

The provided function, generate_graph_and_visualize, seamlessly integrates the creation and visualisation of an Erdos-Renyi graph, random walk sequences, and the subsequent Skip-gram model embeddings. It enhances the interpretability of the graph structure by incorporating a dimensionality reduction step using Principal Component Analysis (PCA). Below is a concise overview of the function’s features:

  1. Erdos-Renyi Graph Creation:
  • Utilizes NetworkX to generate an Erdos-Renyi graph based on user-specified parameters such as the number of nodes (num_nodes) and probability of edge creation (probability).

2. Random Walk Sequence Generation:

  • Creates 100 random walk sequences on the generated graph, with the length of each walk determined by the walk_length parameter.

3. Word2Vec Model Training:

  • Utilises the train_word2vec function to train a Skip-gram model on the generated random walk sequences, obtaining embeddings for each node in the graph.

4. Graph Visualisation:

  • Plots the original Erdos-Renyi graph in one subplot, providing an overview of the graph’s structure.
  • In a second subplot, applies PCA to reduce the dimensionality of the embeddings to 2D and plots the node embeddings. Each node is represented by a point in the scatter plot, enhancing the visual exploration of inter-node relationships.

5. Interactive Parameter Adjustment:

  • Includes slider widgets (num_nodes_slider, probability_slider, walk_length_slider, and vector_size_slider) to allow users to dynamically set values for the number of nodes, probability, walk length, and vector size, respectively.

Now, let’s visualise the output of the above function.

Node embeddings

The visual representation above elegantly encapsulates the core principles of DeepWalk and underscores the significance of Skip-gram within the algorithm.

Notably, it vividly illustrates how nodes, specifically those labelled as 0, 4, 23, 28, and 29, maintain a close spatial relationship in both the original graph and the resulting embedding space. This alignment signifies that nodes with proximity in the graph also exhibit closeness in the embedded feature space, highlighting the effectiveness of Skip-gram in capturing and preserving the structural relationships within the network.

This brings us to the end of the blog. I have tried to be as detailed as possible to make this blog more understandable to you. I hope that I matched the expectation that I had set up initially from this blog where we have touched upon the concept of graph theory while deep diving into the theory and implementation of DeepWalk.

If you feel any need to visit the code, feel free to click the Repo link.

Also, you can read more of my interesting articles/blogs. Some of the links are added below:

I will be back with more such interesting blogs and to help the community at large, until then,

Happy Reading ;)

--

--

mayank khulbe
The Good Food Economy

🔍🧠 Data Scientist | Unraveling Insights in the World of ML & DL 🚀 | Transforming Data into Actionable Intelligence | Passionate of Solving Complex Problems