A computer observing a network of data, in the style of Salvador Dalí [image courtesy of DALL·E]

How to: Link Prediction using a Knowledge Graph and PyTorch Geometric

The past, present and future of building GNNs with TypeDB

James Fletcher

Published in

Vaticle

9 min readAug 3, 2022

What Is Link Prediction and Why Does Context Matter?

Link prediction is the task of predicting new connections between two entities. This task pops up in a wide range of domains and applications. Here we are concerned with domains where link prediction is highly dependent upon the context of each of the entities involved, or the context that connects them. Otherwise, you should consider recommender system methods like collaborative filtering. These focus on finding common clusters rather than link prediction’s emphasis on a deeper contextual analysis.

Why is context important for link prediction and what do we mean by context? We mean data that is connected to a concept of interest. Given these connections matter we are talking about graph data. Context could be directly connected, only one edge away. Or indirectly connected, multiple edges or “hops” away from our concept.

Contextual features could be the only differentiators between our concepts of interest, and hence paramount for link prediction. Take proteins in the life sciences as an example: we may know very little about the protein itself, but we know what other proteins, pathways and functions it is associated with. Therefore the only way to differentiate proteins is via contextual awareness.

An crash dummy taking a weird stride to play hopscotch — Link prediction — hopping into context [image courtesy of DALL·E]

Knowledge graphs are the king of context — their goal is to properly and correctly ingest and model knowledge and meaning. A knowledge graph has schemas, and often support logical reasoning as this goes hand-in-hand with semantic representations. TypeDB is a knowledge graph, and has both of these features, though we won’t be discussing them here.

Given knowledge graphs are so good for context, it’s no surprise that there’s a big rush to use them for link prediction; In our experience as a knowledge graph provider it’s the most requested machine learning task.

In this article we announce a new release with a brand new architecture for building Graph Neural Networks (GNN) with TypeDB, read on for that and how we got there!

Knowledge Graph Machine Learning Used to Be Difficult

We first started researching machine learning for knowledge graphs at Vaticle back in 2018. The space looked very different back then, only early works had laid the foundations for how to think about graph learning. The natural first focus in the literature was on homogenous (un-typed) data. The most common example was link prediction of new protein-protein interactions over an existing dataset. A TypeDB database contains strongly typed, heterogeneous data by definition, and we want to make use of that. So we needed to go beyond the state-of-the-art at the time.

We started out implementing the Stanford SNAP group’s GraphSAGE ourselves, for link prediction on heterogeneous data. That both worked and looked promising (watch the presentation here), but wasn’t a true native graph approach since it uses neighbourhood sampling, so the direction seemed off. Then along came Relational Inductive Biases from DeepMind, and its accompanying library: graph_nets. This was the first proper graph-native learning that we saw, making use of omnidirectional message passing to build node and edge representations. This was the advent of native graph learning. We liked it, so we built with it.

We used graph_nets and TensorFlow to create a purpose-built model for TypeDB that could consume data directly as it’s represented in TypeDB, and do link prediction. Given the nature of the innovation we added, we coined the model as the Knowledge Graph Convolutional Network (KGCN) (presentation here).

The model worked well, and saw use in pharmaceuticals and robotics. The pitch you’ll see in the presentations got its fair share of attention and to this day the direction we see remains the same.

A brain wrapped around a lightbulb — When someone else’s ideas power your brain [image courtesy of DALL·E]

This all sounds great: what could be the problem with having a purpose-built link prediction model for TypeDB? The issue is that this was one very specific implementation that lacked the composability and flexibility for users to build models for their own specific purposes. In essence we didn’t have the bandwidth to evolve this into a full-blown library. A lot of users wanted to experiment with techniques for Graph Neural Networks (GNN) and Graph Convolutional Networks (GCN) that they had read about in the latest suite of graph learning papers, be it attention or transformer architectures. We were stuck without being able to facilitate.

The Graph Learning Space Has Evolved

Since 2018 the graph learning space has absolutely exploded as pioneers realised the potential that lay in building models that can learn from rich context. The name that came to the fore was PyTorch Geometric. We had a lot of users requesting access to a framework like this. Why? To be able to take one of the many algorithms from research papers that come implemented out-of-the-box, and use them quickly on their TypeDB data.

This makes perfect sense: why reinvent the wheel when the raison d’être of PyTorch Geometric (PyG) is to keep up with the latest and greatest graph learning trends in the literature and to implement these algos as building blocks for users to pick-and-mix to flexibly build their own flavour of model.

An alien building a computer out of lego — Using building blocks is wise [image courtesy of DALL·E]

There has also been strong agency in the field, naturally, to build models that leverage heterogeneous data. That also fulfils the major directive of our KGCN — to fully utilise type labels.

Indeed PyTorch Geometric does support a wide range of state-of-the art neural network layers, including convolutional layers for heterogenous data such as:

HGTConv, the Heterogeneous Graph Transformer Operator
HEATConv, the Heterogeneous Edge-Enhanced Graph Attentional Operator
HANConv, the Heterogenous Graph Attention Operator

You can plug together any network architecture you want using these and the general-purpose layers provided (think pooling, linear, convolutional, normalisation).

This composable methodology is clearly very desirable, so how can we use it with TypeDB?

TypeDB Integrations with Graph ML Frameworks

It became very clear that the most useful work that we could do for TypeDB users was to build a bridge between TypeDB and popular frameworks for graph data. This empowers TypeDB users to get up and running quickly and appreciate the value that they can get from TypeDB for machine learning applications (besides all the other reasons to use TypeDB). With that we are pleased to announce TypeDB-ML 0.3 is out now! The repo has been overhauled to provide integrations with some popular frameworks. Follow the README for install instructions.

PyTorch Geometric

We now have integrations for PyTorch Geometric! We have core components to:

Lazily pipe data out of TypeDB (via TypeQL queries) into PyG Data and HeteroData objects (PyG’s graph representation objects),
Encode features from the type information imported from TypeDB, and we provide encoders for attribute values (custom encoders can be added trivially)

Plenty more details in the readme.

Worked Example
Importantly, we’ve constructed a full example for link prediction using TypeDB, TypeDB-ML and PyTorch Geometric. It uses a Heterogeneous Graph Transformer network for link prediction, as per this paper. The approach is capable of making link predictions across all possible valid links in the data provided. The model architecture is set up to predict using the dot product of a representation of the two concepts involved in the link. This means we can efficiently predict links even in dense networks where the number of possible valid links is very large.

The approach is capable of making link predictions across all possible valid links in the data provided.

Modifying this example should kickstart any machine learning project with TypeDB. Even if your problem isn’t link prediction, there’s also plenty of inspiration for how to modify our example and use PyTorch Geometric to build models for edge weight classification, node classification and many more problem statements here, specifically for heterogenous data.

NetworkX

TypeDB-ML also integrates NetworkX for in-memory graph representation. NetworkX comes with a huge array of general-purpose graph algorithms. These algos are a great complement to TypeDB itself, which focuses on pattern matching with logical reasoning. It means that if there’s an algo you really need, you have the option to export a subgraph into NetworkX and run your processing there, in-memory. Naturally this has scaling limitations and isn’t a full replacement for OLAP.

A futuristic bridge — A new bridge to graph learning frameworks [image courtesy of DALL·E]

Sunset of KGCN and the name KGLIB

The next natural step in our machine learning support for TypeDB was to add support for PyTorch Geometric and NetworkX. We have renamed the repo from KGLIB to TypeDB-ML to make it into a machine learning integrations library, and also we’re saying goodbye to our own KGCN model now that it’s quite trivial to build a link prediction model more capable than KGCN using the integrations (again, see the example)!

What Can We Do Now?

PyTorch Geometric integration lets us build arbitrary models for our problems. Following the PyG examples for heterogeneous data we can already see formulations for:

Link prediction
Edge multi-class classification (which used for discrete node weight prediction e.g. movie ratings)
Node multi-class classification

This is already a great toolbox of problem statements solvable with TypeDB, TypeDB-ML and PyTorch Geometric together. Better still is that you can build a model of entirely your own design. Very useful if you’ve got a non-standard problem statement!

How to Build Your Own Link Prediction for TypeDB

We recommend following the structure of the example to get started building your own Graph Neural Network. The key steps to note are the following:

Make a TypeDB database with your data. It takes time and effort to do this properly for a decent sized database, but the community project TypeDB Loader can do all the heavy lifting if you’re loading from tabular data sources. There are a lot of resources for how to create schemas and load data into TypeDB. You can find these on the website, in the examples repo, on the discussion forum and the YouTube channel!
Write queries that can extract the relevant subgraph ready to be split into training, validation and testing datasets. Take a look at the example for how to do this, it’s a bit fiddly right now but doable!
Create an encoder for each attribute type, which can including creating your own kinds of encoder specific to the meaning of your own attribute’s values. For example try wrapping this BERT sentence transformer package if your TypeDB instance has freeform text in any string attributes.
Customise the link prediction model however you want, making it suitable to your task. The examples from PyTorch Geometric are helpful to see how to change the framing and manipulate the data for different learning tasks.

Developer Community

If you try out TypeDB-ML, whether you’re doing link prediction or any other learning task, then jump on the Vaticle community, especially give the Discord server a look — there’s a dedicated #typedb-ml channel and channels to ask for help getting started with TypeDB in general!

Future Work

We would like to support more frameworks, including PyKeen and DGL and expand the examples significantly, with real-world datasets to test on. If you are interested in contributing to any of these parts then please reach out the Discord server!

A breakfast-making steampunk machine — Use TypeDB-ML to build a steampunk breakfast-making machine like this [image courtesy of DALL·E]