WikiNet — An Experiment in Recurrent Graph Neural Networks

By Alexander Hurtado + Aditya Iswara as part of the Stanford CS224W course project (Fall 2021)

Published in

Stanford CS224W GraphML Tutorials

8 min readJan 13, 2022

Introduction

In this blog post, we discuss our application of modern graph machine learning techniques on the Wikispeedia navigation paths dataset. We tackle the task of predicting a player’s target Wikipedia page given the path of their previously visited pages.

A similar problem has been tackled before by West & Leskovec without the use of Graph Neural Networks [1]. It was also tackled by Cordonnier & Loukas using a Graph Convolutional Network with a non-backtracking random walk on the Wikispeedia graph (GitHub link) [2]. Our techniques are distinct from both papers and show some promising results.

You can find our processed dataset on GitHub and run our model on Colab.

Data + Problem Statement

Our data comes from Stanford Network Analysis Project’s (SNAP) dataset collection. The dataset captures navigation paths on the Wikipedia hyperlink network as collected by humans playing Wikispeedia. Wikispeedia, for the uninitiated, is a version of the game known more generally as Wikiracing. The rules of the game are simple — players compete in a race to navigate from a source article on Wikipedia to a set target article by traversing the hyperlinks connecting one page to another.

So what’s our task? Given the Wikipedia page paths created by thousands of users playing Wikispeedia, can we predict a player’s target article if we know the sequence of pages the user has visited so far? And can we do so using the expressiveness afforded by Graph Neural Networks?

Data Pre-processing

Preparing our dataset for use in graph machine learning required a nontrivial amount of preprocessing. Our first goal was to represent our data as a directed graph with Wikipedia articles as our nodes and the hyperlinks connecting articles as our edges. For this step, we relied on NetworkX, a Python network analysis library, and the previous work of Cordonnier & Loukas [2]. To our benefit, Cordonnier & Loukas had graciously processed and encoded the hyperlink graph structure from the SNAP dataset in a Graph Modeling Language (GML) file which we can easily import into NetworkX.

From this point, our next goal was to process data from Cordonnier & Loukas’ repository and the original SNAP dataset to add node-level attributes to each article in our NetworkX graph. For each node, we wanted to include the page’s in-degree and out-degree and a notion of the article’s content as represented by the average of the FastText pre-trained word embeddings corresponding to each word in the article. Again, Cordonnier & Loukas had kindly processed and encoded each page’s degree and corresponding article embeddings in text files. Furthermore, we wanted to add a notion of each article’s hierarchical categorization (e.g. the “Cat” page is categorized under Science > Biology > Mammals) to its corresponding node. For this feature, we processed category data from the SNAP dataset using pandas to parse tab-separated values to yield a multi-hot vector for each article describing which categories the article falls under. We were able to add our new article attributes to each corresponding node in our NetworkX graph by using the library’s set_node_attributes method.

Now, our NetworkX graph is locked, loaded, and ready to go! However, we still need to process and define our input data and labels — that is, our navigation paths and target articles. Much like before, we use pandas to parse tab-separated values of finished navigation paths from the SNAP dataset. We then process each navigation path to remove back clicks (created by Wikispeedia players navigating from their current page back to their immediate previously visited page) and chopped off the last article in each path to yield our target articles (i.e. data labels). Finally, we tossed any navigation path greater than 32 hyperlink clicks long and padded each navigation path to a length of 32 to yield our cleaned navigation paths (i.e. data input).

All in all, our processed dataset yielded over 50,000 navigation paths connected across 4000+ different Wikipedia articles.

WikiNet Model Architecture

Animation of WikiNet’s forward pass — made in Figma

Our novel approach towards predicting target articles in Wikispeedia given navigation paths attempts to combine the capacity of recurrent neural networks (RNNs) to capture sequential knowledge with the capacity of graph neural networks (GNNs) to express intuitions of a graph’s network structure. We present WikiNet — an RNN-GNN hybrid.

As our model input, WikiNet takes in a batch of page navigation paths. This is represented as a sequence of node indices. Each navigation path was padded to a length of 32 — we padded short sequences from the start of the sequence with index -1. We then used a graph neural network to take our existing node attributes and generate node embeddings of size 64 for each Wikipedia page in the hyperlink graph. A tensor of 0’s was used as the node embedding for missing nodes (ex: those padding “nodes” represented by an index of -1).

After calling generating our batched paths as a sequence of node embeddings, we feed this tensor into a batch normalization layer to stabilize training. Then, we feed the tensor into a recurrent neural network — a Long Short-Term Memory (LSTM) model in our case. Another batch normalization layer is applied to the output of the RNN before feeding the tensor to a final linear layer. This linear layer projects the RNN output into one of 4064 classes — one per each potential target node. Finally, a log softmax function is applied onto the scores.

WikiNet Graph Neural Network variations

We will discuss the three graph neural network flavors we used to generate node embeddings in our WikiNet experiments — Graph Convolutional Networks, Graph Attention Networks, and GraphSAGE.

First, let’s discuss how graph neural networks function generally. In a graph neural networks, the key idea is to generate node embeddings for each node based on its local neighborhood. Namely, we can propagate information to each node from its neighboring nodes.

Left: input graph — Right: GNN computation graph for target node A

The above image represents the computation graph for the input graph. x_u represents the features for a given node u. This is a simple example with 2 layers, though GNN computation graphs can be of arbitrary depth. We will call the output at layer n for a node u to be node u’s “layer-n embedding.”

Initially, at layer-0, each node’s embedding is given by their initial node features x. At a high level, layer-k embeddings are generated from layer-(k-1) embeddings by aggregating the layer-k embeddings from each node’s set of neighbors. This motivates a 2-step process for each GNN layer:

Message Computation
Aggregation

In message computation, we pass the layer-k embedding of a node through a “message function.” Then, in aggregation, we aggregate the messages from the neighbors of a node — and from the node itself — using an “aggregation function.” More concretely, we have that:

Graph Convolutional Networks (GCN)

A simple and intuitive approach to message computation is to use a neural network. For aggregation, we can simply take the mean of the neighbor nodes’ messages. In the GCN, we will also use a bias term to aggregate the embedding of the node itself from the previous layer. Finally, we pass this output through an activation function (e.g. ReLU). Thus, the equation looks something like this:

Update rule for Graph Convolutional Networks

Here, we can train the parameter matrices W_k and B_k as part of a machine learning pipeline. [3]

GraphSAGE

GraphSAGE differs from GCN in several ways. In this model, messages are computed within the aggregation function, which is made up of 2 stages. First, we aggregate over a node’s neighbors — using mean aggregation in our case. Then, we aggregate over the node itself by concatenating the node’s previous layer embedding. This concatenation is multiplied by a weight matrix W_k and passed through an activation function to get the output [4]. The overall equation for computing a layer-(k+1) embedding looks like:

Graph Attention Networks (GAT)

GATs are motivated by the fact that not all neighbor nodes have equal importance. GATs are similar to GCNs, but rather than doing simple mean aggregation, we weight nodes using attention weights [5]. Namely, the update rules looks like:

Update rule for Graph Attention Networks

We can see that the update rule for GCN is the same rule as for GAT where:

Note that unlike weight matrices, the attention weights are not unique to each layer. To compute these attention weights, we first compute our attention coefficients. For each node v and neighbor u, the coefficient is computed as:

We can then compute the final attention weights by using a softmax function to ensure that the weights sum up to 1:

This allows us to train the attention weights jointly with our weight matrices.

Training Parameters

To train our model, we randomly split our dataset into training, validation, and testing datasets according to a 90/5/5% split. We used the Adam algorithm as our training optimizer and negative log likelihood as our loss function. We used the following hyperparameters:

Experiment Results

We evaluated our three GNN variations for WikiNet on target article prediction accuracy. This is analogous to the precision@1 metric used by Cordonnier & Loukas [2]. Each of our GNN-RNN hybrid models performed between 3% and 10% greater absolute accuracy on target article prediction than a pure LSTM baseline ran on node features. We report a top accuracy of 36.7% using GraphSAGE as our model’s GNN variant. It seems that the ability of graph neural networks to capture and encode information about a Wikipedia page’s local neighborhood structure leads to significantly greater performance towards target article prediction than a navigation path’s sequence alone.