Graph Analytics: Determining important nodes in a graph using Neo4j

Centrality algorithms for graphs with example

Published in

Data Science in your pocket

6 min readJun 11, 2022

After covering graph databases basics & Pathfinding algorithms, we shift our focus to finding important nodes in the graph using neo4j. The dataset remains the same i.e. Spicejet’s airline network in India that we used in the last post

Note: kindly refer to the previous parts for a better understanding here

Going with life’s general rule, anything important to someone may be useless to others. Hence, even here, the ‘importance’ of a node is subjective & depending upon different criteria, we can have different nodes becoming ‘important’ compared to others. So let’s discuss these criteria 1st & then Neo4j Cypher queries for the same. Based on the criteria, the logic followed to rank nodes based on importance is called centrality algorithms

Do read the other parts here

Graph analytics basics

Pathfinding algorithms in graphs

Community detection in graphs

But before that, let’s repeat a few steps we did in the last blog i.e. Loading CSV in Neo4j & forming a graph projection. Do refer to the previous post for dataset location & Cypher queries for the same

Once done, we are all ready to go

Popularity/ Super spreader

By popularity, we mean the nodes which are connected to most of the other nodes in the graph are important. These nodes can be helpful in spreading any information at a rapid pace throughout the network. Such nodes can be very important in logistic chains.

The degree of a node is the number of connections it has in the graph system. If directed graph, we can define incoming & outgoing degrees separately while sorting nodes for the importance

Hence, sorting nodes by their degree should give us popular nodes.

Let’s find out the most important city in our Airline network

CALL gds.degree.stream('airline')YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS name, score AS total_trafficORDER BY total_traffic DESC, name DESC limit 5

Looking at the query

gds.degree.stream returns nodeId (assigned by system in our case) & the score(degree)
To fetch the name of the nodeId (city name), we use gds.util.asnode
Ordered the result in descending top 5

The result:

╒═══════════╤════════════════╕
│"n.name"   │"outgoingDegree"│
╞═══════════╪════════════════╡
│"Mumbai"   │36              │
├───────────┼────────────────┤
│"Delhi"    │35              │
├───────────┼────────────────┤
│"Bengaluru"│34              │
├───────────┼────────────────┤
│"Hyderabad"│31              │
├───────────┼────────────────┤
│"Chennai"  │27              │

And the least important ones?

═════════════╤════════════════╕
│"n.name"     │"outgoingDegree"│
╞═════════════╪════════════════╡
│"Porbandar"  │1               │
├─────────────┼────────────────┤
│"Dibrugarh"  │2               │
├─────────────┼────────────────┤
│"Kandla"     │2               │
├─────────────┼────────────────┤
│"Pondicherry"│2               │
├─────────────┼────────────────┤
│"Rajkot"     │2               │
└─────────────┴────────────────┘

Bridge/critical

These nodes may not have a maximum number of connections but have critical connections connecting different graph components. Assume there is a road connecting 100s of villages to a Metro city. If this road gets blown off, then we will have a segmented map where Metro may not be able to get its agricultural supplies & villages misses out on important healthcare facilities. Similarly in Star topology, the entire network may go down if the central node is gone.

Hence, a node holding a critical edge can be the most important node in many scenarios as if removed, the graph may get splitted into multiple, disconnected components. One of the ways to find a bridging node is by betweenness criteria discussed ahead.

PageRank

Heard this name already? This is a popular algorithm used by search engines for ranking results. it is based on the intuition that not all edges in a graph coming to a node holds the same importance & hence, node importance is calculated using a weighted sum of the importance of all the nodes to which it is connected. Need to deep dive, follow this post where I have already discussed it thoroughly with example

Let’s run PageRank on our airline network

CALL gds.pageRank.stream('airline')YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS name, score as PageRank_ImportanceORDER BY score DESC limit 5

& the result?

╒═══════════╤═════════════════════╕
│"name"     │"PageRank_Importance"│
╞═══════════╪═════════════════════╡
│"Mumbai"   │3.545723527099885    │
├───────────┼─────────────────────┤
│"Delhi"    │3.3088394494242124   │
├───────────┼─────────────────────┤
│"Bengaluru"│3.248649744452381    │
├───────────┼─────────────────────┤
│"Hyderabad"│2.9839819900485427   │
├───────────┼─────────────────────┤
│"Chennai"  │2.528163472478114    │
└───────────┴─────────────────────┘

As we can observe the results for the top spots are pretty similar to the ones we observed in the popularity case.

We can have multiple variants of the PageRank algorithm according to the system we are designing. A few of them are listed below

Weighted/Personalised PageRank: Here, weights are assigned to some critical nodes beforehand depending on domain knowledge.

The query for Personalized PageRank provides more weightage to ‘Hyderabad’

MATCH (node_city:city{name:'Hyderabad'})CALL gds.pageRank.stream('airline', {sourceNodes: [node_city]})YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS name, scoreORDER BY score DESC, name ASC limit 5

This might require some explanation

First of all, we fetched the node to which we wish to provide weightage beforehand (Hyderabad)
We passed it to the gds implementation of PageRank as sourceNode

The results have changed now

╒═══════════╤════════════════════╕
│"name"     │"score"             │
╞═══════════╪════════════════════╡
│"Hyderabad"│0.20618091845916342 │
├───────────┼────────────────────┤
│"Bengaluru"│0.061735807515226095│
├───────────┼────────────────────┤
│"Mumbai"   │0.05732258194851915 │
├───────────┼────────────────────┤
│"Delhi"    │0.05416241324803417 │
├───────────┼────────────────────┤
│"Chennai"  │0.048330819832748464│
└───────────┴────────────────────┘

ArticleRank: It moves with the idea that the edge comes from a node that has very few connections and should be provided more weightage compared to others

The query for ArticleRank is similar to that of page rank. Even the result is the same in our case as we got in the naive implementation of PageRank

CALL gds.articleRank.stream('airline')YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS name, scoreORDER BY score ASC limit 5

Closeness

As the name suggests, it gives importance to those nodes that are comparatively closer to all the nodes i.e. average distance from all nodes is the lowest. This can again be useful in planning warehouses for any logistic chain where the average distance for each location is critical & not just connectivity (as in popularity). The formula to calculate the closeness metric/centrality for a node ‘U’ can be something like this

closeness centrality(u) = (Total nodes — 1) / sum(shortest distance from ‘u’ to all other nodes)

Let’s find out the least important nodes given closeness centrality

CALL gds.beta.closeness.stream('airline')YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS city, scoreORDER BY score ASC limit 5

The results are different compared to the least important nodes according to popularity criteria

╒═══════════╤═══════════════════╕
│"city"     │"score"            │
╞═══════════╪═══════════════════╡
│"Dibrugarh"│0.40336134453781514│
├───────────┼───────────────────┤
│"Silchar"  │0.40336134453781514│
├───────────┼───────────────────┤
│"Lilabari" │0.40336134453781514│
├───────────┼───────────────────┤
│"Tirupati" │0.4444444444444444 │
├───────────┼───────────────────┤
│"Porbandar"│0.4485981308411215 │
└───────────┴───────────────────┘

Betweenness

This time, we will count for each node how many times did it occur on the shortest path calculated between every node pair present. For example, if we have 5 nodes in a graph then we will first

Calculate the shortest distance between every possible node pair that can be formed
Count how many times a certain node was a part of any shortest path
The node with the highest count is the most important.

As this task can be computationally heavy when we have a big graph, we have approximation algorithms that calculate the Single Source Shortest Path (All shortest paths from a given node as source & other nodes as destination) for a subset of nodes & not all & based on some approximation, calculate betweenness for every node. One of the algorithms that Neo4j GDS follows is Brandes’ approximation (we won't be discussing it though)

So, how to get node importance using betweenness criteria

CALL gds.betweenness.stream('airline')YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS name, scoreORDER BY score desc limit 5

Results?

╒═══════════╤══════════════════╕
│"name"     │"score"           │
╞═══════════╪══════════════════╡
│"Mumbai"   │234.66790314948216│
├───────────┼──────────────────┤
│"Delhi"    │177.51990567911625│
├───────────┼──────────────────┤
│"Bengaluru"│169.0852845984425 │
├───────────┼──────────────────┤
│"Hyderabad"│151.7371778805989 │
├───────────┼──────────────────┤
│"Kolkata"  │110.60240855635591│
└───────────┴──────────────────┘

The results differ a bit compared to previous node importance results as Kolkata now is amongst the top 5 nodes meaning most of the shortest paths between any 2 nodes do include it.

With this, it's a wrap !!

Next, we will take a jibe at machine learning using neo4j