Graph Analytics: Relationship (link) prediction in graphs using Neo4j
determining future relations in a network
As we come towards the end of this Graph analytics blog series, we pick up the most important use case one can cater to using graphs i.e. predicting future relationships in a network. This can be of great use for social media platforms (remember the friend suggestion feature on Instagram or FB) or even helpful in building recommender systems Or solving some sort of Murder Mystery as well (remember the last murder mystery movie you saw & how the police nab the mastermind by drawing some sort of network between suspects & finding a connection between them)
There are many base ideas that we can follow to determine potential links in a graph using the algorithms we have already discussed in the past blogs.
Note: Kindly visit the previous few blogs for a better understanding
Pathfinding algorithms in graphs
A few prerequisites we need to do in Neo4j (as we did in all the previous blogs)
- Load dataset from CSV after putting it in the import folder
The query
CALL apoc.load.csv("flight_network_spicejet.csv")
YIELD list as item
MERGE(n:city{name:item[0],lat:toInteger(item[3]),long:toInteger(item[4])})
MERGE (m:city{name:item[1]})
MERGE (n)-[:flight{distance:toInteger(item[2])}]->(m)
- Create a graph projection
CALL gds.graph.project("airline",{city:{properties:["lat","long"]}},{flight:{properties:"distance",orientation:'UNDIRECTED'}})
If you face any issues & can’t understand the above queries, do visit my previous post for clarification on how to import data with all steps in detail
Community-based
The idea is straightforward if two node pairs belong to the same community, there is a potential for an edge between them. How is the community for a node determined? You can refer to the previous post based on the same community in our airline network.
The entire process can be divided into two parts
- Determine communities
- Form edges between nodes that fall in the same community but don’t have an edge as of now
Query 1 (determining query using Louvain algorithm)
CALL gds.louvain.write('airline', { writeProperty: 'community' })YIELD communityCount, modularity, modularities
In the above query, we determined the community of each node using Louvain & create a new property per node called ‘community’
Next, we will create an edge between any 2 nodes which fall in the same community but don't have an edge
MATCH (u:city)MATCH (v:city)where u<>v and u.community=v.communityOptional MATCH (u)-[r:flight]->(v) where r is nullcreate (u)-[x:predicted]->(v)
This needs some explanation
- First of all, we are trying to test every node pair (u, v) such that u!=v & they are in the same community
- The optional match is done to figure out whether an edge is present or not
- If not, create an edge ‘predicted’ between them
The output
The new edges (yellow) created do appear to form some clusters. This is quite obvious as we have created new edges between those node pairs where they fall in the same community, hence appearing as clusters on visualization
Distance-based
- Shortest Path
As you must have assumed, this directly relates to the shortest distance between two nodes in the network. How to calculate the shortest path between nodes? we have discussed this as well. We can have a go with either of the below conditions to predict links :
- Set up a threshold. If the shortest distance between any node pair is less than this threshold, we say there can be a potential node
Or
- We can sort node pairs based on the shortest distance & pick up top ’n’ pairs where we say we have a strong probability of having a link.
The query for the first type of link prediction (using a threshold) is below
match (u:city)match (v:city)where u<>vOptional MATCH (u)-[r:flight]->(v) where r is nullCALL gds.shortestPath.dijkstra.stream('airline', {sourceNode: u,targetNode: v,relationshipWeightProperty: 'distance'})YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, pathwhere totalCost<750create (u)-[x:predicted]->(v)
This is quite similar to the one we used earlier for ‘community-based’ predictions but with some subtle changes
- We are calculating the shortest path between two nodes (u,v) where u!=v & there is no edge between them
- The shortest path between the 2 nodes should be less than 750 (km) for a new edge to exist
The output
2. The Katz Index
Katz index takes into account all the possible paths present between any node pair & does a summation to calculate the final metric to consider a link or not. The formula goes something like this
Link prediction metric = ∑ₐβᵃ path(u, v; a)
Let’s decode it
- Here, path(u, v; a) refers to total paths between node u & v of the length ‘a’
- βᵃ is nothing but a weightage given to paths of each length. For example, paths of length1 or 2 should be given higher weightage & hence a higher β compared to paths of length 5 or 6
So if we have 2 paths of length=1, 3 paths of length=2 & 2 paths of length=3, then our Katz index=
β¹ x 2 + β² x 3 + β³x 2
where β¹ , β² , β³ are weights assigned for each path length
Neighborhood-based
Common connections
As the name suggests, the more common connections between 2 nodes, the more the chances of a link between the 2 nodes. So if ‘Raj’ & ‘Tanu’ has 100 mutual friends compared to ‘Raj’ & ‘Rani’ with 80 mutual friends, there are more chances of a friendship between Raj & Tanu than Raj & Rani (People you may know)
Link prediction chances = set(connections(Raj))∩set(connections(Tanu))
Below is the query for the same where we have put on a condition that minimum common neighbors should be >8
MATCH (u) MATCH (v)where u<>v and gds.alpha.linkprediction.commonNeighbors(u, v)>8optional match (u)-[r:flight]-(v) where r is nullcreate (u)-[x:predicted]->(v)
The above query is similar to what we discussed for the above cases, where we find node pairs (u,v) such that they aren’t the same & don’t have an edge. If yes, then we calculate the common neighbor's count & if it is bigger than a given threshold, a new edge is predicted
The output
Adamic-Adar
An extension of Common connections, rather than finding just common connections, it also provides higher weights to rare common connections compared to other common connections. The formula for calculating Adamic-Adar
Link Prediction chances =Σi 1/log(Degree(i)),
i ∈ Connections(u) ∩ Connections(v)
Let’s decode this equation
- What is ‘i’ first of all? its a common connection node in the intersection of Connections(u) ∩ Connections(v)
- What we are doing is a summation of the inverse of the degree of each common connection
Assume we need to calculate Link Prediction chances between node U & node V in the below scenarios
In the left scenario, X has degree 3 while on the right-hand side, X has degree 7. If X is the only common connection, then the final Link Prediction chance for
- Left:=1/log(3)
- Right = 1/log(7)
i.e. A link is more possible between U & V in the left graph than on the right graph. Similarly, if we have more common connection nodes, the inverse degree of each of them will add up to get the final Link possibilities
Total neighbors
The ideology here is Popular people usually know each other. Hence, the same can be applied to networks as well. This metric can be calculated by
Connections(u) ∪ Connections(v)
On the same grounds (that of popularity), we can have another formula as well i.e. Degree(u) x Degree(v) called Preferential attachment. Hence, the more the connections, the more the chances of a link between them as well
Let’s run an example with a preferential attachment where
degree(u) x degree(v)>1000
MATCH (u)
MATCH (v)where u<>v and gds.alpha.linkprediction.preferentialAttachment(u, v)>1000optional match (u)-[r:flight]-(v) where r is nullcreate (u)-[x:predicted]->(v)
The output
This brings an end to the Graph Analytics blog series !!