Link Prediction with Neo4j Part 1: An Introduction

This is the beginning of a series of posts (part 2) about link prediction with Neo4j. We’ll start the series with an overview of the problem and associated challenges, and in future posts will explore how the link prediction functions in the Neo4j Graph Algorithms Library can help us predict links on example datasets.

Update: The O’Reilly book “Graph Algorithms on Apache Spark and Neo4j Book is now available as free ebook download, from neo4j.com

Let’s start at the beginning…what is link prediction?

What is link prediction?

Link prediction has been around for ages, but as far as I can tell was popularised by a paper written by Jon Kleinberg and David Liben-Nowell in 2004, titled The Link Prediction Problem for Social Networks.

Kleinberg and Liben-Nowell approach this problem from the perspective of social networks, asking the question:

Given a snapshot of a social network, can we infer which new interactions among its members are likely to occur in the near future? We formalize this question as the link prediction problem, and develop approaches to link prediction based on measures for analyzing the “proximity” of nodes in a network.

More recently, my colleague Dr Jim Webber has been explaining how the evolution of a graph can be determined from its structure in his entertaining GraphConnect San Francisco 2015 talk where he explained World War 2 using graphs.

Apart from predicting World Wars or friendships in a social network where else might we want to predict links?

We could predict future associations between people in a terrorist network, associations between molecules in a biology network, potential co-authorships in a citation network, interest in an artist or artwork, to name just a few use cases.

In each these examples, predicting a link means that we are predicting some future behaviour. For example in a citation network, we’re actually predicting the action of two people collaborating on a paper.

Link Prediction Algorithms

Kleinberg and Liben-Nowell describe a set of methods that can be used for link prediction. We can see a list of these in the diagram below.

Algorithms described in Kleinberg and Liben-Nowell’s paper

These methods compute a score for a pair of nodes, where the score could be considered a measure of proximity or “similarity” between those nodes based on the graph topology.

The closer two nodes are, the more likely there will be a relationship between them.

Let’s have a look at a few of these measures in more detail so we can get a feel for what they do.

Common Neighbors

One of the simplest measures that we can compute is common neighbors, as described by Ahmad Sadraei:

The common-neighbors predictor captures the notion that two strangers who have a common friend may be introduced by that friend.
This introduction has the effect of “closing a triangle” in the graph [i.e. a triadic closure as described in Jim’s talk]

As the name suggests, this measure computes the number of common neighbors that a pair of nodes share.

In the graph above, nodes A and D have 2 common neighbors (nodes B and C), whereas nodes A and E only have one common neighbor (node B).

Therefore nodes A and D would be considered closer and more likely to be connected by a link in future.

Adamic Adar

This algorithm was introduced in 2003 by Lada Adamic and Eytan Adar while researching how to predict links in a social network.

This measure builds the common neighbors, but rather than just counting those neighbors, it computes the sum of the inverse log of the degree of each of the neighbors.

The degree of a node is the number of neighbors it has, and the intuition behind this algorithm is that when it comes to closing triangles, nodes of low degree are likely to be more influential.

For example, in a social network, for two people to be introduced by a common friend, the probability of that happening is related to how many other pairs of friends that person has. An unpopular person may therefore be more likely to introduce a pair of their friends.

Preferential Attachment

This is one of the most well known concepts amongst network scientists, having been popularised by Albert-László Barabási and Réka Albert through their work on scale-free networks.

The intuition is that nodes with lots of relationships will gain more relationships.

This measure is one of the easiest to compute — we take the product of the degree of each node.

Link Prediction — Neo4j Graph Algorithms Library

The Neo4j Graph Algorithms library currently contains 6 link prediction algorithms — Adamic Adar, Common Neighbors, Preferential Attachment, Resource Allocation, Same Community, and Total Neighbors.

We chose to start with these ones as they are some of the most common ones we encountered while researching this topic. These algorithms are used to compute values that will be used as part of a larger process.

Link Prediction algorithms in the Neo4j Graph Algorithms library

Let’s quickly learn how the other 5 algorithms work:

  • Adamic Adar — the sum of the inverse log of the degree of common neighbors
  • Preferential Attachment — the product of the degree of each node
  • Resource Allocation — the sum of the inverse of the degree of common neighbors
  • Same Community — checks whether two nodes are in the same community computed by one of the community detection algorithms
  • Total Neighbors — total unique neighbors of the two nodes

Now let’s have a quick look at how to use them. If you haven’t installed the Graph Algorithms library before, it’s super easy — just a one click install from the Neo4j Desktop.

You can also read Jennifer Reif’s excellent blog post for a more detailed explanation.

Now let’s see how to use the common neighbors function from the library by revisiting the example graph from the previous section.

We’ll first create the graph in Neo4j by executing the following Cypher query:

UNWIND [["A", "C"], ["A", "B"], ["B", "D"], 
["B", "C"], ["B", "E"], ["C", "D"]] AS pair
MERGE (n1:Node {name: pair[0]})
MERGE (n2:Node {name: pair[1]})
MERGE (n1)-[:FRIENDS]-(n2)

And then compute common neighbors for nodes A and D using that function:

neo4j> MATCH (a:Node {name: 'A'})
MATCH (d:Node {name: 'D'})
RETURN algo.linkprediction.commonNeighbors(a, d);
+-------------------------------------------+
| algo.linkprediction.commonNeighbors(a, d) |
+-------------------------------------------+
| 2.0 |
+-------------------------------------------+
1 row available after 97 ms, consumed after another 15 ms

These nodes have 2 common neighbors, so they receive a score of 2. Now let’s do the same for nodes A and E. We expect them to receive a score of 1 as they only have a single common neighbor.

neo4j> MATCH (a:Node {name: 'A'})
MATCH (e:Node {name: 'E'})
RETURN algo.linkprediction.commonNeighbors(a, e);
+-------------------------------------------+
| algo.linkprediction.commonNeighbors(a, e) |
+-------------------------------------------+
| 1.0 |
+-------------------------------------------+

As we expected, a score of 1 — good times!

By default the function assumes that it can compute its score using relationships of any type and in either direction. We could also choose to specify this explicitly by passing in parameters:

neo4j> WITH {direction: "BOTH", relationshipQuery: "FRIENDS"} 
AS config
MATCH (a:Node {name: 'A'})
MATCH (e:Node {name: 'E'})
RETURN algo.linkprediction.commonNeighbors(a, e, config)
AS score;
+-------+
| score |
+-------+
| 1.0 |
+-------+

Just for good measure, let’s try one of the other functions to check we’ve got the hang of this.

The preferential attachment function returns the product of the degrees of the two nodes. If we run that for nodes A and D we would expect to get a score of 2*2=4 as nodes A and D both have two neighbors. Let’s give it a try:

neo4j> MATCH (a:Node {name: 'A'})
MATCH (d:Node {name: 'D'})
RETURN algo.linkprediction.preferentialAttachment(a, d)
AS score;
+-------+
| score |
+-------+
| 4.0 |
+-------+

Great! So we know how to compute these scores for our Neo4j graphs. What now?

What should we do with the link prediction scores?

So we’ve now determined that our can be solved by link prediction and we’ve computed the relevant proximity measures described above, but we still need to decide how to use these measures to predict links.

There are two approaches that I came across in the literature:

Using the measures directly

We can use the scores from the link prediction algorithms directly. With this approach we would set a threshold value above which we would predict that a pair of nodes will have a link.

In the example above we might say that every pair of nodes that has a preferential attachment score above 3 would have a link, and any with 3 or less would not.

Supervised learning

We can take a supervised learning approach where we use the scores as features to train a binary classifier. The binary classifier then predicts whether a pair of nodes will have a link.

Guillaume Le Floch describes these two approaches in more detail in his blog post Link Prediction In Large-Scale Networks.

In this series of posts we’re going to focus on the supervised learning approach.

Building a machine learning classifier

Once we’ve decided to use a supervised learning approach, we need to make some decisions related to our machine learning workflow:

  • What machine learning model do we want to use?
  • How are we going to split our data into train and test sets?

Machine learning model

Many of the link prediction measures that we’ve covered so far are computed using similar data, and when it comes to training a machine learning model this means there is feature interaction issue that we need to deal with.

Some machine learning models assume that the features they received are independent. Providing such a model with features that don’t meet this assumption will lead us to predicts things with low accuracy. If we choose one of these models we would need to exclude features with high interaction.

Alternately, we can simplify our life by choosing a model where feature interaction isn’t so problematic.

Ensemble methods tend to work well as they don’t make this assumption on their input data.

This could be a gradient boosting classifier as described in Guillaume’s blog post or a random forest classifier as described in a paper written by Kyle Julian and Wayne Lu.

Train and test data sets

The tricky thing with the train and test set is that we can’t just randomly split the data, as this could lead to data leakage.

Data leakage can occur when data outside of your training data is inadvertently used to create your model. This can easily happen when working with graphs because pairs of nodes in our training set may be connected to those in the test set.

When we compute link prediction measures over that training set the measures computed contain information from the test set that we’ll later evaluate our model against.

Instead we need to split our graph into training and test sub graphs. If our graph has a concept of time our life is easy — we can split the graph at a point in time and the training set will be from before the time, the test set after.

This is still not a perfect solution and we’ll need to try and ensure that the general network structure in the training and test sub graphs is similar.

Once we’ve done that we’ll have pairs of nodes in our train and test set that have relationships between them. They will be the positive examples in our machine learning model.

Now for the negative examples.

The simplest approach would be to use all pair of nodes that don’t have a relationship. The problem with this approach is that there are significantly more examples of pairs of nodes that don’t have a relationship than there are pairs of nodes that do.

The maximum number of negative examples is equal to:

# negative examples = (# nodes)² - (# relationships) - (# nodes)

i.e. the number of nodes squared, minus the relationships that the graph has, minus self relationships.

If we use all of these negative examples in our training set we will have a massive class imbalance — there are many negative examples and relatively few positive ones.

A model trained using data that’s this imbalanced will achieve very high accuracy by predicting that any pair of nodes don’t have a relationship between them, which is not quite what we want!

So we need to try and reduce the number of negative examples. An approach described in several link prediction papers is to use pairs of nodes that are a specific number of hops away from each other.

This will significantly reduce the number of negative examples, although there will still be a lot more negative examples than positive.

To solve this problem we either need to down sample the negative examples or up sample the positive examples.

What next?

We’re now covered the theory behind link prediction, and we’re ready to try out this approach on a dataset. That’s exactly what we’ll be doing in the next post of the series, which will be coming out next week.

In the mean time, if you’re interested in this topic the following links may be of interest:

Free download: O’Reilly “Graph Algorithms on Apache Spark and Neo4j”