A network science approach

Analysis of the Indian Railway Network

Deconstructing the third largest railway network of the world

Suryank Tiwari
CodeX

--

Photo by Alexander Popov on Unsplash

The onset of trains and railroad networks has proven to be indispensable to mankind. With their readily accessible nature, the world has become a smaller place whether you’re on your way to a cross-country skiing trip, transferring goods and commodities to far-off moors, or steaming through platform 9¾ to learn how to fend off dementors.

Almost each country has their own railway network and dissecting these real world networks help us with scaling, congestion management and overall socio-economic growth.

This article briefly glides through the makings and markings of a Network Science project and highlights the results and inferences obtained.

Objectives

  • To extract properties and inferences out of the Indian Railway Network or IRN.
  • To be able to query the network and draw out subgraphs and connections between regions.
  • Compute various centrality measures, visualizations and network parameters.

Dataset

The data used throughout our analyses comes from the Indian Railways Timetable for trains available as on 01.11.2017 published by the Indian Government.

The dataset consists of the following columns:

Columns of Indian Railways Timetable dataset. (Image by author)

Each row in the dataset tells us about a train with a specific name and number, traveling to a specific destination from a specified source spanning a deterministic distance within said timings.

For a train traveling from source A to final destination Z, there will be rows for each station the train will stop at in the journey, separated by an incrementing sequence number. Look at this snapshot from the dataset.

Snapshot from the dataset — Indian Railways Timetable for trains available as on 01.11.2017 published by the Indian Government.

In this above snapshot, we have a story about a train called ‘SWV-MAO-VLNK’ which will travel from SAWANTWADI R (SWV) at 10:25:00 to MADGAON JN. (MAO), arriving at 12:10:00 traveling a distance of 78 kilometers. In its journey, the train will stop at Thivim (THVM), and Karmali (KRMI).

We know this is the path train took since it is kindly highlighted to us by an incrementing sequence number denoted by SEQ. The SEQ resets to 1 when we start looking at a different train. Through incrementing SEQs, the values pertaining to the Source Station and the Destination Station stay constant.

This way we have the data of 7580 unique trains and 8147 different railway stations.

If you want to have a go at the dataset yourself, here is the source.

Network formation

Since we have cleared the first hurdle of securing our data and understanding what it represents, the next obvious step is to form a network, admittedly without which analyzing it becomes infinitely more complex.

We’ll need to define a set of heuristics which drive our network forming process. Following are the rules we defined for construction.

  • Each railway station is a unique node.
  • An edge exists between two nodes if there exists a train that connects them both. Hence each station in a route must be connected to each other via a direct edge, forming a complete graph.
  • Edge Weights can be considered either on the basis of distance and on the number of trains between two nodes. We’ll try out both.

With our ground rules in place, we can remove null values from the dataset, and create directed weighted graphs based on train count between two stations.

We’re making directed graphs as we can simply convert to undirected ones whenever desired by a simple call to graph.to_undirected(), thanks to networkx.

Here are some basic details about our formed network:

Number of nodes: 8147
Number of edges: 902602
Average in degree: 110.7895
Average out degree: 110.7895

As we can see above, IRN has total 8147 railway stations . With our heuristic stated above, we have a total of 902602 edges with average in and out degree as 110.7895.

Since, this graph is a directed depiction of the IRN. Converting it to an undirected graph and fetching the same properties

Number of nodes: 8147
Number of edges: 500412
Average degree: 122.8457

We notice that the number of nodes is the same while the number of edges decrease, which is to be expected. The average degree has also increased from 110.7895 to 122.8457, again to be expected.

But reading about these values isn’t very alluring. Let’s try to bring some visuals into play. We should plot the entire graph and see how the network really looks.

Spring Layout for the Indian Railway Network. (Image by author)

Fair to say that this image over here isn’t doing any justice to our railway network. The number of nodes and edges is far too large to be plotted in a meaningful way.

We need to take out a subgraph out of this network and we’ll have a better time with plotting. You can find the code used this segment here.

Computing regions between two nodes — Subgraph Querying and Visualization System

In this section we visualize a set of nodes (stations) specified, their neighbors and the shortest path between them.

We plan to query a set of nodes from the original railway network that we have created. For this set of nodes, we will highlight the shortest route from one every station to another. And for prettifying the graph, we will also plot out the direct neighbors of the nodes we have queried.

To find the code for this section, you can check out this notebook.
(Note: Edge weight code hasn’t been updated in the linked version for distance based clubbing)

The following is the subgraph querying and visualization demo for the station codes: DAA and SWV with train count as edge weights

Subgraph highlight shortest route between stations DAA and SWV (Image by author)

Here, the shortest route from DAA to SWV is highlighted in red. The original queried nodes are printed out in bold letters. There’s no labelling the neighboring stations since that would just lead to clutter.

We can see there are other paths leading from DAA to SWV as well, but the highlighted path is at the mercy of networkx library’s shortest_path function, which is doing a fine job at highlighting the shortest path.

Keep in mind since that we are using number of trains connecting a station as our edge weights this graph doesn’t necessarily gives the shortest physical distance between two railway stations, so we didn’t use the edge weight feature here. We just took the route with the minimum number of stations in between.

We get a totally different graph (duh) if we use the distance between the stations as the edge weight. Let’s query it out for the same stations.

Subgraph highlight shortest route between stations DAA and SWV, edge weight by distance (Image by author)

So we have some concept of distance between the neighbors now, and turns out the route we had originally wasn’t the shortest one since we have to go through a different station ST now.

Let’s see some more demonstrations for demonstration’s sake.

The earlier stations we saw were smaller stations. Let’s query the subgraph for New Delhi (NDLS), Dehradun (DDN) and Bangalore (BNCE). Initially we look at the train count edge weighted graph.

Subgraph highlight shortest route between stations NDLS, DDN and BNCE (Image by author)

We see the neighboring stations clubbed out in layers as the number of trains they can be connected by would a discrete number. Hence the stations connected by 1 train only sort together, neighbors connected to our queried stations connected by 2 different trains sort together and so on. It certainly makes for a visually pleasing pattern.

Here’s another:

Subgraph highlight shortest route between stations DAA, UHP and CUR (Image by author)

This is fun but we can’t drive real inferences from these subgraphs. Let’s move ahead to find some useful information.

Network Analysis and Inferences

In this section we’ll go through the basic definitions of a few types of centralities and what they mean. We’ll also list the top stations with highest values in those specific fields.

Degree Centrality

Definition: Degree centrality computes an importance score based on the number of nodes by which each node is connected.

Inference: Degree Centrality tells us what are the most popular, or well-connected nodes in the network.

These nodes have the most one-hop connections to other nodes or Stations in our case. These nodes can connect to the wider network in less number of hops. In our network, the most connected stations, holding the most information are the following:

        STATION NAME		 DEGREE CENTRALITY	 HOWRAH JN. 		 0.30211146575006137
VIJAYWADA JN 0.2704394794991407
KANPUR CENTR 0.263810459121041
VARANASI JN. 0.25767247728946724
GHAZIABAD JN 0.25202553400441935
KALYAN JN 0.24797446599558065
ITARSI 0.24441443653326786
LUCKNOW JN. 0.243677878713479
AHMEDABAD 0.23852197397495703
MATHURA JN. 0.2363123005155905

Betweenness Centrality

Definition: Betweenness centrality is a measure of the number of times a station node lies on the shortest path between other stations.

Inference: This measure shows which stations act as a “bridge” connecting other stations together. The stations having a high betweenness centrality will lie on shortest paths between two stations. These stations have a high influence in the network.

The stations which act as a bridge or have high influence on the flow of the network are the following:


STATION NAME BETWEENNESS CENTRALITY
HOWRAH JN. 0.03406351792928565
SEALDAH 0.021795409939440936
KANPUR CENTR 0.020930718809083173
VIJAYWADA JN 0.015177400281457002
AHMEDABAD 0.014301664879802252
YESVANTPUR J 0.014256569821548529
VADODARA JN. 0.012967968029171826
VARANASI JN. 0.012623744668081235
KOLKATA 0.011898350401287686
PILIBHIT JN. 0.01184539309233258

Closeness Centrality

Definition: Closeness centrality assigns score to each station node based on their ‘closeness’ to all other stations in the railway network.

Inference: Closeness Centrality is helpful in finding the stations which are placed to influence the entire railway network most quickly. This measure can help identify good “broadcasters” in a network.

The stations with the highest closeness centrality in order are the following:

       STATION NAME		 BETWEENNESS CENTRALITY	 HOWRAH JN. 		 0.5124082033874587
AHMEDABAD 0.5107660799308908
VADODARA JN. 0.5063522924820026
KANPUR CENTR 0.5062879529276847
VARANASI JN. 0.5052287059583946
NEW DELHI 0.5051966767517281
KALYAN JN 0.5036640360941574
MUGHAL SARAI 0.5035049206471066
VIJAYWADA JN 0.5021406666088064
ITARSI 0.4998101090743702

Eigenvector Centrality

Definition: This measure works like degree centrality, by measuring a station’s influence over the others based on its degree but it also takes into account things like well-connectedness of a station and how many nodes are the connections connected to, and so on, throughout the network.

Inferences: Eigen Vector Centrality is a good all rounder measure. It shows which stations have influence over the entire network, not just the stations it is connected to.

        STATION NAME		 BETWEENNESS CENTRALITY	 VARANASI JN. 		 0.07072646825994845
HOWRAH JN. 0.0696564673531263
KANPUR CENTR 0.06676069453032213
ITARSI 0.06674728443587287
NEW DELHI 0.06624320025503437
LUDHIANA JN. 0.06593208791866616
MATHURA JN. 0.06569071466478024
KALYAN JN 0.06527068019550916
PATNA JN. 0.06473312502358505
MUGHAL SARAI 0.06457963445769375

Inferences based on Geographic Location

The main stations in the Indian Railway Network based on the station(node)-degree relationship can be inferred from the degree centrality. These stations are further classified into two groups, based on their geographic locations:

  1. Stations that are near metropolitan cities in India. For example: Howrah which is near Calcutta, Ghaziabad near New Delhi and Kalyan station near Mumbai
  2. Stations that are located in the central parts of the country, or at the meetin points of railway lines connecting different zones of the country. For example: Vadodara Junction, Vijaywada Junction. These stations are potential points of congestion in the network as they do not have much resources as compared to the stations near metropolitan states.

All these nodes handle a high amount of traffic.

For example:

  1. Howrah- Near Calcutta which is a metropolitan city. (Type 1) Likely to have more resources.
  2. Itarsi- Falls in the central region of the country. (Type 2) Likely to have less resources and cause congestion.

General Inferences

Number of trains: 7580Number of stations: 8147Longest train route: 4260 km. Train =  15905 CAPE - DBRG . Starting station:  KANNIYAKUMARI . Ending Station:  DIBRUGARHShortest train route: 1 km. Train =  3308 PLJE-SZE MEM . Starting station:  PHULWARTANR . Ending Station:  SONARDIHMaximum distance between any two consecutive stations: 1301  km with train RSD-PJP BSFMinimum distance between any two consecutive stations: 1  km with train  LGL KNJ EMUAverage total train route distance: 439.7 kmAverage distance between consecutive stops 17.91 km

With these sections, some inferences have been made out about the network which can help us identify possible points of congestion and other general details. We are going to dive deeper into network based properties in the next section.

The code for this section can be found in this notebook.

Further Network Analysis

In this final section, we look at some common network properties for IRN. The topics will be loosely coupled amongst themselves but will highlight crucial information about the network.

Plotting the correlation between in-degree kᵢₙ and out-degree kₒᵤₜ of the IRN.

In-degree is defined as the number of paths that are coming to a node, whereas out-degree is defined as the number of paths going out of a node in a network. The plot of correlation of in-degree vs out-degree for IRN is:

Correlation between in-degree and out-degree of the IRN. (Image by author)
  • The strong in-degree and out-degree correlation and the high reciprocity parameter R of 0.8912 both indicate that IRN is a symmetrical network.
  • The reciprocity calculated for the directed graph is defined as the ratio of the number of edges pointing in both directions to the total number of edges in the graph.
  • The above inference shows that if there is a train from station i to station j , there is also a great possibility that there is a train from station j to station i .

Reflecting on the Connectivity of Nodes in the Railway Network

Following are the stations with maximum degree in our network:

The 4 stations having the highest degree (that are directly accessible the most) by different stations in the Indian Railway Network are:
HOWRAH JN. 1284
VIJAYWADA JN 1171
VARANASI JN. 1124
KANPUR CENTR 1119

Path Length Distribution of the Railway Network

The path length is defined as the shortest distance between two nodes. The average path length is the average of the path length averaged over all pairs of nodes.

By counting how many node pairs (source station and destination station) have a particular path length we find the path length distribution of the network.

Path length distribution of IRN (Image by author)

The above graph infers that most of the node pairs are connected having the shortest path length as 3.

The stations are connected with each other maximally at a distance of 10 and number of node pairs doing so are 18. Imagine switching 10 trains to reach your station.

The code for this final section can be found at this notebook.

There are some more experiments we performed that can be found in the Github repository for this project.

Feel free to drop any comments or any ideas that can be further performed on this topic.

Thanks for reading.

Contributions:

Dr. Ganesh Bagler — Mentor and guide throughout the project.

Suryank Tiwari (Suryank Tiwari)

Rose Verma (Rose Verma)

Harshita Pandey

This project has been done for the course Network Science 2021 at IIITD.

--

--

Suryank Tiwari
CodeX
Writer for

SDE at Amazon | Masters from IIITD in CSE with AI specialization| Bachelors in CSE from MSIT