Graph Analytics: Pathfinding algorithms using Neo4J

Analyzing the Indian airline network between major cities

Mehul Gupta
Data Science in your pocket
11 min readMay 25, 2022

--

As we have already discussed some basics around graph databases & Neo4j in my last post, It’s time to work with some real-world datasets & a huge range of graph algorithms Neo4j offers, starting with path-finding algorithms between the Nodes in the graph.

Do read the other parts here

Graph analytics basics

Finding important nodes in graphs

Community detection in graphs

Will will be trying out the below algorithms available in the GDS library

Dijkstra (Shortest path)

allShortestPath

Yen’s K shortest algorithm

A* for shortest path

Single-Source Shortest Path (SSSP)

A few pre-requisites to start with

  1. Basic understanding of Graph databases, Neo4j & Cypher (I will be explaining all the Cypher queries though)
  2. Graph Data Science (GDS) library plugin
  3. APOC plugin

How to add these plugins?

There aren’t any ‘pip’ here😅

Assuming you have set up a sample database (as in my last post), follow the below video

  1. Click on the database name
  2. On the right side pane emerged, click on plugins & install GDS (& APOC as well)
Adding GDS plugin

Do add the below line at the end of the neo4j Config file before moving ahead following the below video

dbms.security.procedures.whitelist= apoc.load.*
apoc.import.file.enabled=true

We will be analyzing a graph for the Indian Flight Network (Only SpiceJet) fetched from https://data.gov.in/catalog/airsewa.

Nodes: Cities

Relationship/Edge: Flight

Relationship weight: Aerial distance (lat long distance)

Lat & Long : Latitude & Longitude for origin city

The processed dataset can be found below

Samples from the dataset.

Before applying any algorithm, we will do 2 things

Load this CSV in Neo4j

Create a projected graph (sort of view in RDBMS)

Load CSV

For loading any CSV to the neo4j database, we first need to put it in the import folder of the database. Again click on the 3 dots as for the neo4j config. Click on ‘import’, a new folder will open. Copy-paste your CSV into that folder

Once put in the import folder, run the Cypher query to load the dataset as a graph

CALL apoc.load.csv("flight_network_spicejet.csv")YIELD list as itemMERGE(n:city{name:item[0],lat:toInteger(item[3]),long:toInteger(item[4])})MERGE (m:city{name:item[1]})MERGE (n)-[:flight{distance:toInteger(item[2])}]->(m)

Explaining the query

  • apoc.load can be used to load different types of datasets be it XML, CSV, etc
  • We have used Merge in place of Create as Merge act as Upsert (Insert if not available already, else update). Create will create duplicate nodes if already exist. So, to be on the safer side, Merge can be used
  • While reading a file using apoc, rows of the files are returned as a list of items. Hence, ‘item’ is a list of all row values in the CSV.
  • Relationship: flight, Label: city, Relationship properties: distance, Node properties: city name, lat & long for the city
  • (n)-[:relationship]->(m) creates a directed relationship between node ’n’ & ‘m’ from ’n’ →’m’
  • What does YIELD do? it just points out which variable is returned by the function we wish to have. Is it similar to return? No, will discuss the difference between the 2 later in the post

The output?

Spicejet airline network, India

Can we make undirected graphs in Neo4j?

I tried finding the answer for the same but came across this post stating relationships are necessarily directed in neo4j, though, while creating a projection (next in the post), we can make these relations undirected

We are done with the 1st step, now creating a projected graph

Graph Projection

A graph projection can be considered as a ‘view’ in RDBMS where we can alter certain properties of the data (like subsetting, altering properties, etc) without affecting the original stuff.

##CYPHER QUERYCALL gds.graph.project("airline",{city:{properties:["lat","long"]}},{flight:{properties:"distance",orientation:'UNDIRECTED'}})

The above Cypher query does the following

Creates a graph projection: ‘airline’

The 2nd param is the label for nodes to be included & the node properties to include

The 3rd parameter is the relationship to include & its properties to include alongside orientation (whether the relations are directed or undirected)

Now, let’s run our 1st algorithm to find the shortest path between 2 cities using

Dijkstra algorithm

Dijkstra is amongst the most popular shortest path algorithm helpful in finding the shortest path possible between 2 nodes of a graph. Assuming you already know how the algo works, let’s straight away jump onto the Neo4j implementation. If not, do check it out here.

Let’s find the shorted aerial route from Jammu to Darbhanga

MATCH (source:city {name: 'Jammu'}), (target:city {name: 'Darbhanga'})CALL gds.shortestPath.dijkstra.stream('airline', {sourceNode: source,targetNode: target,relationshipWeightProperty: 'distance'})YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, pathRETURNindex,gds.util.asNode(sourceNode).name AS sourceNodeName,gds.util.asNode(targetNode).name AS targetNodeName,totalCost,[nodeId IN nodeIds | gds.util.asNode(nodeId).name] AS nodeNames,costs,nodes(path) as path

And between ‘Pondicherry’ & ‘Kandla’

Note: Kindly ignore arrowheads in any visualization & consider it to be an undirected path only as this appears to be a limitation with Neo4j that you can’t visualize undirected graphs. Though, the path calculated is considering the graph as an undirected graph but visualization can be slightly misguiding.

Let’s summarize the above Cypher query

  • The source & destination are first picked from the graph as objects (notice the MATCH part of the query to find required nodes as objects)
  • GDS library’s Dijkstra implementation is called over ‘airline’ projection graph & passed onto the source, destination & weights to consider from the projection
  • By default, the implementation returns the nodeId for the corresponding nodes that are to be returned. By using gds.util.asNode(), we can get the ‘name’ property of the node
  • nodes(path) majorly help in visualizing the entire path
  • As the algorithms return a number of things, YIELD is used to select which values we wish to have, and RETURN helps to return these values to the user/output (some processing can be done over the values in the RETURN clause but not YIELD)

By switching the view from graph to tabular (on the neo4j browser, on the left-hand side we have options to choose the output format as table or graph), we can download the results as CSV or JSON

allShortestPath

Following Dijkstra, it gives the shortest path between every node possible in the graph projection passed. To make it easier to understand the output, we will be finding the shortest path to all cities from ‘Amritsar’

Cypher query

CALL gds.alpha.allShortestPaths.stream('airline', {relationshipWeightProperty: 'distance'})YIELD sourceNodeId, targetNodeId, distanceWITH sourceNodeId, targetNodeId, distanceWHERE gds.util.isFinite(distance) = trueMATCH (source:city) WHERE id(source) = sourceNodeIdMATCH (target:city) WHERE id(target) = targetNodeIdWITH source, target, distance WHERE source <> target and source.name='Amritsar'RETURN source.name AS source, target.name AS target, distance as distance_in_kms

The output is a 3 columned table

│"source"  │"target"            │"distance_in_kms"│
╞══════════╪════════════════════╪═════════════════╡
│"Amritsar"│"Ahmedabad" │981.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Aurangabad" │1391.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Bagdogra" │1517.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Belgaum" │1802.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Bengaluru" │2085.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Bhopal" │966.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Calicut" │2327.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Chennai" │2131.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Coimbatore" │2314.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Darbhanga" │1318.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Dehradun" │336.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Delhi" │401.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Dibrugarh" │2208.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Goa" │1805.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Guwahati" │1863.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Gwalior" │1528.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Hubli" │1883.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Hyderabad" │1624.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Jabalpur" │1068.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Jaipur" │531.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Jammu" │738.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Jodhpur" │890.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Kandla" │5714.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Kanpur" │792.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Kochi" │2451.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Kolkata" │1676.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Leh" │1013.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Madurai" │2427.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Mangalore" │2116.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Mumbai" │1406.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Patna" │1203.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Surat" │6989.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Tirupati" │2046.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Udaipur" │867.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Varanasi" │1083.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Vijayawada" │1871.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Visakhapatnam" │1761.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Pune" │1482.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Srinagar" │888.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Pondicherry" │2242.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Port Blair" │2879.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Rajahmundry" │1978.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Thiruvananthapuram"│2583.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Tuticorin" │2550.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Rajkot" │1355.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Lilabari" │2125.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Silchar" │2045.0 │
├──────────┼────────────────────┼─────────────────┤
│"Amritsar"│"Porbandar" │1850.0 │
└──────────┴────────────────────┴─────────────────┘

Summarizing the query

gds implementation for allShortestPath is called over the entire projection

It returns the origin, destination & total distance for every node pair possible (no path is returned, just the total cost)

We have applied 3 filters in the later parts of the query

Don’t return results where there exists no path (where distance is finite)

Where origin & destination are the same (as it calculates the path for every node pair possible, it also includes paths for self-directed origins like ‘Delhi’-’Delhi’ or ‘Pune’-’Pune’)

Where origin is ‘Amritsar’ (if skipped, returns shortest path between all node pairs)

Though, GDS has an implementation for Single Source Shortest Path (SSSP) which finds the shortest path to every possible node from a given Node (something similar we did use allShortestPath). So allShortestPath & SSSP are just a ‘where’ clause away.

Yen’s K shortest algorithm

Yen’s K shortest algorithms find K shortest path from origin to destination i.e shortest path, 2nd shortest path…kth shortest path.

How does it work?

Using Dijkstra, it finds the shortest path between 2 nodes. Then for the next shortest path, it prohibits the paths already considered as the shortest path & likewise. Let’s go through a small example to understand

https://blogs.reed.edu/compbio/2018/06/01/yens-k-shortest-paths/

Assume we need to find the k-shortest path from ‘a’ to ‘e’ where k=2

  • Following Dijkstra, we found shortest path=a →c →e
  • Next, for all the edges occurring in the shortest path, we will set their weight=∞ one by one & find the shortest path
  • So, on setting a →c=∞, we get the shortest path=a →b →e with cost=4
  • On setting c →e=∞, we get the shortest path = a →b →e with cost =4 again. hence a →b →e becomes are 2nd shortest path.

Assuming we would have got a →c →b →e as the shortest path on setting c →e=∞, then the shortest path amongst a →b →e & a →b →c →e would have been considered as the 2nd shortest path. Likewise, as we move to calculate the 3rd and 4th shortest paths, we will be setting edges involved in these already considered shortest paths as infinite one by one & find different shortest paths. Amongst all the shortest path discovered, one with the lowest total cost will be considered

Cypher query for finding 3 shortest paths between Hyderabad & Mumbai

MATCH (source:city {name: 'Mumbai'}), (target:city {name: 'Hyderabad'})CALL gds.shortestPath.yens.stream('airline', {sourceNode: source,targetNode: target,k: 3,relationshipWeightProperty: 'distance'})YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, pathRETURNindex,gds.util.asNode(sourceNode).name AS sourceNodeName,gds.util.asNode(targetNode).name AS targetNodeName,totalCost,[nodeId IN nodeIds | gds.util.asNode(nodeId).name] AS nodeNames,costs,nodes(path) as path

The query is pretty similar to previous queries. The only subtle changes are

  • We pass k=n where n is the total shortest path we wish to get
  • The implementation returns multiple rows, one each for the shortest path

Minimum Spanning Tree

A spanning tree is a path that connects all the nodes in the graph with minimum edges. It is of 2 types

  • Minimum spanning tree: A spanning tree with the lowest total cost of traversing. By cost, we mean the sum of weights of edges considered in the tree
  • Maximum spanning tree: A spanning tree with the Highest total cost of traversing. By cost, we mean the sum of weights of edges considered in the tree

Cypher query for a Minimum spanning tree starting from node ‘Delhi’

MATCH (n:city{name: 'Delhi'})CALL gds.alpha.spanningTree.minimum.write('airline', {startNodeId: id(n),relationshipWeightProperty: 'distance',writeProperty: 'MINST',weightWriteProperty: 'writeCost'})YIELD preProcessingMillis, computeMillis, writeMillis, effectiveNodeCountreturn preProcessingMillis, computeMillis, writeMillis, effectiveNodeCount

Query explanation

  • gds calls the minimum spanning tree algorithm intaking graph name, id for root node (Delhi) & weights to consider (distance in our case). If we replace ‘minimum’ with ‘maximum’, we get a maximum spanning tree
  • Unlike previous implementations, this one creates a new relationship (MINST) & adds a property ‘writeCost’ to this new relationship in the original graph
  • Returned values are nothing but more a meta-information returned around the entire operation which can be ignored

Next, we need to run the below query to see the final output

match (n) return n
The blue edges represent the MINST relationship written into the graph by the previous query representing the Minimum Spanning Tree from Delhi

A* Shortest distance

Though I assume most of you must have read about it, will give a small explanation around A* to find the shortest distance between given origin & destination nodes before jumping on the Cypher query.

As in Djikstra, the final weight considered for choosing a path is the pre-calculated weight we have already assigned in the graph, A* has a cost function for each edge comprising 2 components

  • Precalculated weights
  • Estimated distance from the current node to the final node

So, if we have the above graph with origin=A & destination=E & assume we are on node A currently, then the cost function F that would be compared to either move to C or B would be

  • Weight(A,B) + Estimated_Distance(B,E)
  • Weight(A,C) + Estimated_Distance(C,E)

Whichever has the lower value for F, would be preferred. Also, the Estimated_Distance() isn’t the same as Weight() but a heuristic function that can have no relation to the recalculated weights. Also, Estimated_Distance(X, Y) should surely be lower Total_Cost(X, Y) for the shortest distance between X &Y on the graph.

Coming back to the gds implementation of A*, the Estimated_Distance() used here is Haversine distance i.e. the distance between 2 points on a spherical surface (like Earth) given their latitude & longitude. So if you remember, we considered lat & long for the origin city in our graph for each city but never used it. It’s time we use it !!

Cypher Query for shortest distance using A* between Tirupati & Leh

MATCH (source:city {name: 'Tirupati'}), (target:city {name: 'Leh'})CALL gds.shortestPath.astar.stream('airline', {sourceNode: source,targetNode: target,latitudeProperty: 'lat',longitudeProperty: 'long',relationshipWeightProperty: 'distance'})YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, pathRETURNgds.util.asNode(sourceNode).name AS sourceNodeName,gds.util.asNode(targetNode).name AS targetNodeName,totalCost,[nodeId IN nodeIds | gds.util.asNode(nodeId).name] AS nodeNames,costs,nodes(path) as path

The query is almost similar to the one implemented for Djikstra with a slight change i.e. passing latitude & longitude property names in the config map.

Before we finish, we must know the difference between ‘stream’ & ‘write’ modes used by gds for different algorithms

Stream: It returns the results then and there on the output console for the Cypher query

Write: It writes the results back in the Neo4j Database & returns meta information like nodes affected, time is taken, etc on the output console

Hence, when we used the ‘stream’ mode for A*, Dijkstra, allShortestPath & K-Shortest, we were able to visualize the results then & there but not in the case of Minimum Spanning Tree whose results are rather written back in the DB & we ran the 2nd query to fetch the results

With this it's a wrap for now, next we will pickup analyzing the spatial dataset using Neo4j !!

--

--