Graph Analytics: Relationship (link) prediction in graphs using Neo4j

determining future relations in a network

Published in

Data Science in your pocket

6 min readJul 3, 2022

As we come towards the end of this Graph analytics blog series, we pick up the most important use case one can cater to using graphs i.e. predicting future relationships in a network. This can be of great use for social media platforms (remember the friend suggestion feature on Instagram or FB) or even helpful in building recommender systems Or solving some sort of Murder Mystery as well (remember the last murder mystery movie you saw & how the police nab the mastermind by drawing some sort of network between suspects & finding a connection between them)

There are many base ideas that we can follow to determine potential links in a graph using the algorithms we have already discussed in the past blogs.

Note: Kindly visit the previous few blogs for a better understanding
Graph analytics basics
Pathfinding algorithms in graphs
Finding important nodes in graphs
Community detection in graphs

A few prerequisites we need to do in Neo4j (as we did in all the previous blogs)

Load dataset from CSV after putting it in the import folder

flight_network_spicejet.csv

Edit description

drive.google.com

The query

CALL apoc.load.csv("flight_network_spicejet.csv")
YIELD list as item
MERGE(n:city{name:item[0],lat:toInteger(item[3]),long:toInteger(item[4])})
MERGE (m:city{name:item[1]})
MERGE (n)-[:flight{distance:toInteger(item[2])}]->(m)

Create a graph projection

CALL gds.graph.project("airline",{city:{properties:["lat","long"]}},{flight:{properties:"distance",orientation:'UNDIRECTED'}})

If you face any issues & can’t understand the above queries, do visit my previous post for clarification on how to import data with all steps in detail

Community-based

The idea is straightforward if two node pairs belong to the same community, there is a potential for an edge between them. How is the community for a node determined? You can refer to the previous post based on the same community in our airline network.

The entire process can be divided into two parts

Determine communities
Form edges between nodes that fall in the same community but don’t have an edge as of now

Query 1 (determining query using Louvain algorithm)

CALL gds.louvain.write('airline', { writeProperty: 'community' })YIELD communityCount, modularity, modularities

In the above query, we determined the community of each node using Louvain & create a new property per node called ‘community’

Next, we will create an edge between any 2 nodes which fall in the same community but don't have an edge

MATCH (u:city)MATCH (v:city)where u<>v and u.community=v.communityOptional MATCH (u)-[r:flight]->(v) where r is nullcreate (u)-[x:predicted]->(v)

This needs some explanation

First of all, we are trying to test every node pair (u, v) such that u!=v & they are in the same community
The optional match is done to figure out whether an edge is present or not
If not, create an edge ‘predicted’ between them

The output

Yellow (predicted) and blue (original) links

The new edges (yellow) created do appear to form some clusters. This is quite obvious as we have created new edges between those node pairs where they fall in the same community, hence appearing as clusters on visualization

Distance-based

Shortest Path

As you must have assumed, this directly relates to the shortest distance between two nodes in the network. How to calculate the shortest path between nodes? we have discussed this as well. We can have a go with either of the below conditions to predict links :

Set up a threshold. If the shortest distance between any node pair is less than this threshold, we say there can be a potential node

We can sort node pairs based on the shortest distance & pick up top ’n’ pairs where we say we have a strong probability of having a link.

The query for the first type of link prediction (using a threshold) is below

match (u:city)match (v:city)where u<>vOptional MATCH (u)-[r:flight]->(v) where r is nullCALL gds.shortestPath.dijkstra.stream('airline', {sourceNode: u,targetNode: v,relationshipWeightProperty: 'distance'})YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, pathwhere totalCost<750create (u)-[x:predicted]->(v)

This is quite similar to the one we used earlier for ‘community-based’ predictions but with some subtle changes

We are calculating the shortest path between two nodes (u,v) where u!=v & there is no edge between them
The shortest path between the 2 nodes should be less than 750 (km) for a new edge to exist

The output

2. The Katz Index

Katz index takes into account all the possible paths present between any node pair & does a summation to calculate the final metric to consider a link or not. The formula goes something like this

Link prediction metric = ∑ₐβᵃ path(u, v; a)

Let’s decode it

Here, path(u, v; a) refers to total paths between node u & v of the length ‘a’
βᵃ is nothing but a weightage given to paths of each length. For example, paths of length1 or 2 should be given higher weightage & hence a higher β compared to paths of length 5 or 6

So if we have 2 paths of length=1, 3 paths of length=2 & 2 paths of length=3, then our Katz index=

β¹ x 2 + β² x 3 + β³x 2

where β¹ , β² , β³ are weights assigned for each path length

Neighborhood-based

Common connections

As the name suggests, the more common connections between 2 nodes, the more the chances of a link between the 2 nodes. So if ‘Raj’ & ‘Tanu’ has 100 mutual friends compared to ‘Raj’ & ‘Rani’ with 80 mutual friends, there are more chances of a friendship between Raj & Tanu than Raj & Rani (People you may know)

Link prediction chances = set(connections(Raj))∩set(connections(Tanu))

Below is the query for the same where we have put on a condition that minimum common neighbors should be >8

MATCH (u) MATCH (v)where u<>v and  gds.alpha.linkprediction.commonNeighbors(u, v)>8optional match (u)-[r:flight]-(v) where r is nullcreate (u)-[x:predicted]->(v)

The above query is similar to what we discussed for the above cases, where we find node pairs (u,v) such that they aren’t the same & don’t have an edge. If yes, then we calculate the common neighbor's count & if it is bigger than a given threshold, a new edge is predicted

The output

Adamic-Adar

An extension of Common connections, rather than finding just common connections, it also provides higher weights to rare common connections compared to other common connections. The formula for calculating Adamic-Adar

Link Prediction chances =Σi 1/log(Degree(i)),

i ∈ Connections(u) ∩ Connections(v)

Let’s decode this equation

What is ‘i’ first of all? its a common connection node in the intersection of Connections(u) ∩ Connections(v)
What we are doing is a summation of the inverse of the degree of each common connection

Assume we need to calculate Link Prediction chances between node U & node V in the below scenarios

Hands-On Graph Analytics with Neo4j (oreilly.com)

In the left scenario, X has degree 3 while on the right-hand side, X has degree 7. If X is the only common connection, then the final Link Prediction chance for

Left:=1/log(3)
Right = 1/log(7)

i.e. A link is more possible between U & V in the left graph than on the right graph. Similarly, if we have more common connection nodes, the inverse degree of each of them will add up to get the final Link possibilities

Total neighbors

The ideology here is Popular people usually know each other. Hence, the same can be applied to networks as well. This metric can be calculated by

Connections(u) ∪ Connections(v)

On the same grounds (that of popularity), we can have another formula as well i.e. Degree(u) x Degree(v) called Preferential attachment. Hence, the more the connections, the more the chances of a link between them as well

Let’s run an example with a preferential attachment where

degree(u) x degree(v)>1000

MATCH (u) 
MATCH (v)where u<>v and  gds.alpha.linkprediction.preferentialAttachment(u, v)>1000optional match (u)-[r:flight]-(v) where r is nullcreate (u)-[x:predicted]->(v)

The output

This brings an end to the Graph Analytics blog series !!

Graph Analytics: Relationship (link) prediction in graphs using Neo4j

determining future relations in a network

flight_network_spicejet.csv

Edit description

Community-based

Distance-based

Neighborhood-based

Total neighbors

Written by Mehul Gupta