Graph Analytics: Determining important nodes in a graph using Neo4j
Centrality algorithms for graphs with example
After covering graph databases basics & Pathfinding algorithms, we shift our focus to finding important nodes in the graph using neo4j. The dataset remains the same i.e. Spicejet’s airline network in India that we used in the last post
Note: kindly refer to the previous parts for a better understanding here
Going with life’s general rule, anything important to someone may be useless to others. Hence, even here, the ‘importance’ of a node is subjective & depending upon different criteria, we can have different nodes becoming ‘important’ compared to others. So let’s discuss these criteria 1st & then Neo4j Cypher queries for the same. Based on the criteria, the logic followed to rank nodes based on importance is called centrality algorithms
Do read the other parts here
Pathfinding algorithms in graphs
But before that, let’s repeat a few steps we did in the last blog i.e. Loading CSV in Neo4j & forming a graph projection. Do refer to the previous post for dataset location & Cypher queries for the same
Once done, we are all ready to go
Popularity/ Super spreader
By popularity, we mean the nodes which are connected to most of the other nodes in the graph are important. These nodes can be helpful in spreading any information at a rapid pace throughout the network. Such nodes can be very important in logistic chains.
The degree of a node is the number of connections it has in the graph system. If directed graph, we can define incoming & outgoing degrees separately while sorting nodes for the importance
Hence, sorting nodes by their degree should give us popular nodes.
Let’s find out the most important city in our Airline network
CALL gds.degree.stream('airline')YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS name, score AS total_trafficORDER BY total_traffic DESC, name DESC limit 5
Looking at the query
gds.degree.stream returns nodeId (assigned by system in our case) & the score(degree)
To fetch the name of the nodeId (city name), we use gds.util.asnode
Ordered the result in descending top 5
The result:
╒═══════════╤════════════════╕
│"n.name" │"outgoingDegree"│
╞═══════════╪════════════════╡
│"Mumbai" │36 │
├───────────┼────────────────┤
│"Delhi" │35 │
├───────────┼────────────────┤
│"Bengaluru"│34 │
├───────────┼────────────────┤
│"Hyderabad"│31 │
├───────────┼────────────────┤
│"Chennai" │27 │
And the least important ones?
═════════════╤════════════════╕
│"n.name" │"outgoingDegree"│
╞═════════════╪════════════════╡
│"Porbandar" │1 │
├─────────────┼────────────────┤
│"Dibrugarh" │2 │
├─────────────┼────────────────┤
│"Kandla" │2 │
├─────────────┼────────────────┤
│"Pondicherry"│2 │
├─────────────┼────────────────┤
│"Rajkot" │2 │
└─────────────┴────────────────┘
Bridge/critical
These nodes may not have a maximum number of connections but have critical connections connecting different graph components. Assume there is a road connecting 100s of villages to a Metro city. If this road gets blown off, then we will have a segmented map where Metro may not be able to get its agricultural supplies & villages misses out on important healthcare facilities. Similarly in Star topology, the entire network may go down if the central node is gone.
Hence, a node holding a critical edge can be the most important node in many scenarios as if removed, the graph may get splitted into multiple, disconnected components. One of the ways to find a bridging node is by betweenness criteria discussed ahead.
PageRank
Heard this name already? This is a popular algorithm used by search engines for ranking results. it is based on the intuition that not all edges in a graph coming to a node holds the same importance & hence, node importance is calculated using a weighted sum of the importance of all the nodes to which it is connected. Need to deep dive, follow this post where I have already discussed it thoroughly with example
Let’s run PageRank on our airline network
CALL gds.pageRank.stream('airline')YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS name, score as PageRank_ImportanceORDER BY score DESC limit 5
& the result?
╒═══════════╤═════════════════════╕
│"name" │"PageRank_Importance"│
╞═══════════╪═════════════════════╡
│"Mumbai" │3.545723527099885 │
├───────────┼─────────────────────┤
│"Delhi" │3.3088394494242124 │
├───────────┼─────────────────────┤
│"Bengaluru"│3.248649744452381 │
├───────────┼─────────────────────┤
│"Hyderabad"│2.9839819900485427 │
├───────────┼─────────────────────┤
│"Chennai" │2.528163472478114 │
└───────────┴─────────────────────┘
As we can observe the results for the top spots are pretty similar to the ones we observed in the popularity case.
We can have multiple variants of the PageRank algorithm according to the system we are designing. A few of them are listed below
- Weighted/Personalised PageRank: Here, weights are assigned to some critical nodes beforehand depending on domain knowledge.
The query for Personalized PageRank provides more weightage to ‘Hyderabad’
MATCH (node_city:city{name:'Hyderabad'})CALL gds.pageRank.stream('airline', {sourceNodes: [node_city]})YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS name, scoreORDER BY score DESC, name ASC limit 5
This might require some explanation
- First of all, we fetched the node to which we wish to provide weightage beforehand (Hyderabad)
- We passed it to the gds implementation of PageRank as sourceNode
The results have changed now
╒═══════════╤════════════════════╕
│"name" │"score" │
╞═══════════╪════════════════════╡
│"Hyderabad"│0.20618091845916342 │
├───────────┼────────────────────┤
│"Bengaluru"│0.061735807515226095│
├───────────┼────────────────────┤
│"Mumbai" │0.05732258194851915 │
├───────────┼────────────────────┤
│"Delhi" │0.05416241324803417 │
├───────────┼────────────────────┤
│"Chennai" │0.048330819832748464│
└───────────┴────────────────────┘
- ArticleRank: It moves with the idea that the edge comes from a node that has very few connections and should be provided more weightage compared to others
The query for ArticleRank is similar to that of page rank. Even the result is the same in our case as we got in the naive implementation of PageRank
CALL gds.articleRank.stream('airline')YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS name, scoreORDER BY score ASC limit 5
Closeness
As the name suggests, it gives importance to those nodes that are comparatively closer to all the nodes i.e. average distance from all nodes is the lowest. This can again be useful in planning warehouses for any logistic chain where the average distance for each location is critical & not just connectivity (as in popularity). The formula to calculate the closeness metric/centrality for a node ‘U’ can be something like this
closeness centrality(u) = (Total nodes — 1) / sum(shortest distance from ‘u’ to all other nodes)
Let’s find out the least important nodes given closeness centrality
CALL gds.beta.closeness.stream('airline')YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS city, scoreORDER BY score ASC limit 5
The results are different compared to the least important nodes according to popularity criteria
╒═══════════╤═══════════════════╕
│"city" │"score" │
╞═══════════╪═══════════════════╡
│"Dibrugarh"│0.40336134453781514│
├───────────┼───────────────────┤
│"Silchar" │0.40336134453781514│
├───────────┼───────────────────┤
│"Lilabari" │0.40336134453781514│
├───────────┼───────────────────┤
│"Tirupati" │0.4444444444444444 │
├───────────┼───────────────────┤
│"Porbandar"│0.4485981308411215 │
└───────────┴───────────────────┘
Betweenness
This time, we will count for each node how many times did it occur on the shortest path calculated between every node pair present. For example, if we have 5 nodes in a graph then we will first
- Calculate the shortest distance between every possible node pair that can be formed
- Count how many times a certain node was a part of any shortest path
- The node with the highest count is the most important.
As this task can be computationally heavy when we have a big graph, we have approximation algorithms that calculate the Single Source Shortest Path (All shortest paths from a given node as source & other nodes as destination) for a subset of nodes & not all & based on some approximation, calculate betweenness for every node. One of the algorithms that Neo4j GDS follows is Brandes’ approximation (we won't be discussing it though)
So, how to get node importance using betweenness criteria
CALL gds.betweenness.stream('airline')YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS name, scoreORDER BY score desc limit 5
Results?
╒═══════════╤══════════════════╕
│"name" │"score" │
╞═══════════╪══════════════════╡
│"Mumbai" │234.66790314948216│
├───────────┼──────────────────┤
│"Delhi" │177.51990567911625│
├───────────┼──────────────────┤
│"Bengaluru"│169.0852845984425 │
├───────────┼──────────────────┤
│"Hyderabad"│151.7371778805989 │
├───────────┼──────────────────┤
│"Kolkata" │110.60240855635591│
└───────────┴──────────────────┘
The results differ a bit compared to previous node importance results as Kolkata now is amongst the top 5 nodes meaning most of the shortest paths between any 2 nodes do include it.
With this, it's a wrap !!
Next, we will take a jibe at machine learning using neo4j