The Tale of India’s 8495 Railway Stations…

(Done as part of my Mini Project on Data Visualization during the course work of PhD @ IDC, IIT-Bombay [IITB-Monash Research Academy].)


Train Journeys always fascinate me. As I grew up and traveled across the vast sub-continent of India, I was amazed to see the spread of the railway network. A massively successful organization — running 11,000 trains in a day is no joke! Indian Railways transports almost 2.5 crore passengers daily (That’s nearly the total population of New Zealand, Australia and Tasmania put together!). If the tracks of Indian railways were to be laid out, they would circle the earth almost 1.5 times. The station with the smallest name is called “IB”: It’s in Odisha. And there are 8495 railway stations across the country, connecting Kashmir to Kanyakumari and Gujarat to Assam. To get a better sense of this majestic network of stations, we decided to see how the Railway Stations are spread across the country.

We worked on developing an Interactive Visualization for the spread of Railway Stations. The links to the GitHub and Interactive Visualization are here.
Click here for the Interactive Visualization.

We started collecting information about all the railway stations. Our search span included: website of Ministry of Railways, Open Data, similar blog posts, News Paper articles, archives. The entire data collection process took us almost two months. We also filed an RTI with the Ministry of Railways seeking information on stations. This was to ensure that the data is reliable and verified. We were interested to know the coordinates of the stations, their official names, station codes, states, zones and divisions in which they are, category and date since which they are functioning. We managed to get all the above information except the coordinates of the stations and the date of functioning. We were amazed to find that there are 8495 stations in the country. Out of which 11 stations are not categorized into any railway category; they are either closed due to gauze conversion or are under construction.

State wise count of Railway Stations
Screen Shot from Railway Board notice dated: 2012.

The stations are categorized into 7 different categories: A1, A, B, C, D, E and F depending upon the earning which is the indicator of Passenger Traffic. The criteria of station categorization is shown here.

Distribution of stations into Different Categories across the states.

Organizational Structure of Indian Railways

Indian Railways is divided into total 17 zones, which are further sub-divided into 71 divisions. The zone wise spread of stations is shown below.

It’s a Pareto chart showing the stations across zones in descending order of frequency, with a cumulative line on the secondary axis as a percentage of tota

Raw Data to Interactive Visualization

All the collected data were stored using the Microsoft Excel. Preliminary descriptive statistical analysis was also done using MS Excel. Then the data was separated into 7 CSV UTF-8 format files, each file for one station category. The idea was to create 7 layers of data and plot on the map. For plotting on the map, Mapbox Studio was used.

Till now we had collected all the information that we needed but the coordinates. Without having the coordinates, we could not plot on the map. Our search took us to the GitHub page of Indian Railway Data by Sanjay and Sajjad. This repository has Indian Railways data. There are three JSON files: Stations, Trains and Schedules. We were interested in the station data. But we could get information of roughly half of the stations. Then we decided to collect the coordinates (or what we call Geocode) using Awesome Table add-on for Google Sheets. We thought, Wow! This will so easy now. We copied down the entire station data into Google Sheets and started to Geocode. The coordinates started to populate. We were delighted to see the coordinates getting added one-by-one.

It stopped all of a sudden and showed the error “The Script has been invoked too many times”. We realized that we have hit the upper limit. We tried to copy the content to other Google Sheets and do it, but it was intelligent enough to identify that it’s the same user. And we had to wait for another 24 hours at least to start the Geocoding again. In the mean time we decided to plot the points and see what we got. We plotted the points (Only latitude and longitude) on a coordinate plane (X-Y plot). We had thought of to get a skeletal map of India. But, NO! What is this? Except 119 stations rest 867 stations were getting plotted on some other country, some inside ocean and some on mountains!

On investigating this, we could narrow down to one possible explanation for this. We had provided only the station codes to the Awesome Table to Geocode. What it was doing is, searching across the globe (possibly google database) for a probable match for the station code and returned its geocode. We decided to include the station code, station name, zone, state. We also padded the term “Station” into the search term. So now the Awesome Table was looking for the geocode of “BBS Bhubaneswar Station ECoR Odisha” where first term is the station code, then Station name, followed by “station” then zone which for BBS is East Cost Railway and finally State. And we started the geocoding again. Now we were getting geocodes within the national boundary. We again thought, this is it. In a couple of days we will get all the coordinates. And we finally got the coordinates for all 8495 stations in 12 days. It’s a common practice in data science to look for anomalies in the data before going for visualization. We run a script to identify is there any geocode (coordinates) whose frequency is more than 1; then it would mean that we have more than one station with the same coordinates or simply it means two or more overlapping stations! We found that 776 stations had frequency count for geocodes more than 1. Now we had to resort back to manual identification and collection of coordinates. By manual collection of geocode, we mean, search in google for possible hit or at least try to find a village or small town with the same name and state. Because all the places that had this issue were category E or F stations i.e. very small stations or just Halts. We managed to get geocodes (possibly correct or places closest to the station) of 543 out of 776 stations. It took us 10 more days.

The data were then added to Mapbox to create maps and then we hosted all the data, and the Interactive Visualization on GitHub. The Visualization was done using HTML, CSS and JavaScript.

Adding Data to Mapbox

Mapbox provides elegant map styles that one can use directly. It’s like DIY maps. You import your data into Mapbox Studio and use them on the selected mapbox styles. But all the mapbox styles had many layers of data already included in it. Sometimes the layers go into 30 in number. And it includes information about the entire world. Our need was only India. So, instead of removing the layers from mapbox style one-by-one and checking which one to use and which not, we decided to go for our own map style. So we chose empty style in mapbox and created our own style. To create the administrative map of India: states and districts, we collected the corresponding shape files. We got the shape files from website of Datameet: Comunity created maps of India. Now we had the shape files to create map of India and CSV UTF-8 files having station information. We created 9 different layers in our custom mapbox style: 1 each for state and district boundary and 1 each for 7 different station categories.

The district boundary shape file that we had was getting loaded only from zoom level 6. This means, we were not able to see the district boundaries for the entire country at one glance. For visualization we had set the default zoom level to 4. So, we need to set the minimum zoom level for the district shape file data from zoom level 4 or below. We got this link of Mapbox tutorial. We decided to use Mapbox Studio Classic. This is a desktop application to design custom maps and datasets powered by vector tiles. We loaded our shape file and changed the minimum zoom level to 3 and uploaded the dataset to Mapbox Studio.

Find the Interactive Visualization for the spread of Railway Stations below. The links to the GitHub and Visualization are here.

Using Mapbox custom Map Styles over Blank Map Style…

Our idea behind using a blank map style and include only the data we had collected was to present a clean canvas to visualize the data. This visualization had no Layers provided by Mapbox i.e. our visualization had no information about cities and towns, roads or any other feature. The feedback for this visualization was that, people were searching for nearby cities, towns or villages, roads (highways and national highways), rivers and airports to see where the station is located. So we decided to use Mapbox provided layers too. We narrowed down on two Map styles: Streets and Satellite Streets. The Streets map style has all the above needed information where as the Satellite Street provides the additional satellite view. The satellite view combined with other geographical information will provide to check and verify if the coordinates of the station are correct in addition to providing a sense of the proximity of the location to the end user. This feature can be used by clicking on the radio buttons to toggle between Map styles.

Street and Satellite Mapbox Map styles. Click here for the Interactive Visualization.

We then tried to provide a search facility in our visualization. Our intention was to have a feature where we can provide the station name, state or any other details and fetch the information accordingly. While working to incorporate this feature, we observed that the query was returning null. We were not able to figure out the problem; the station name provided by the user and the station name in the dataset seemed exactly the same to naked eye. On manual investigation we found that in the dataset the station name data field had long sequence of “spaces” padded at the end. So what actually seemed empty or blank to naked eye had actually invisible characters (in UTF-8 or UNICODE format). We had taken the station names from the excel file that we had received from Ministry of Railways as RTI reply. We then removed the unnecessary spaces present after the station names in the STATION_NAME field. We are yet to provide the Search feature in the visualization.


Interpreting the Interactive Visualization

One can go the GitHub and Visualization by clicking on the hyperlinks. The visualization page is shown below. One can chose the map styles by clicking on the radio buttons against Blank, Street or Satellite. The layers: State, Districts, and all the Categories can be switched on or off by clicking on the layer names provided on the left side of the page. If a layer is added currently then it will be shown in Blue (like States and Category-A1 layers in the below picture) and the layers not added are shown with white background (like Districts and Category-D layers in the below picture). In our initial attempt what we achieved was, on mouse hovering over any added/loaded layer the corresponding coordinates was showed. If mouse hovers over any station, then on the right side of the screen one could see in “GeoJSON format” all the information like coordinates, station name, station code, zone, state and districts. Similarly on hovering over state or district boundary one could see the boundary coordinates in “GeoJSON” format on the right side of the page. We worked on to provide this mouse hover information in an user friendly way, so that Non-Technical persons too can understand the presented information with ease. Now we are providing the Station information as a Pop-Up on mouse hover over any station. In simple English, if the viewer moves the mouse pointer over any circle (colored dot) then there will appear a popup with the Station Name, Code, Category and State information. And the information will vanish on moving the mouse away. The links to the Github Page and this medium post is also provided on the bottom left of the visualization page.


Road Ahead…

  1. We now intend to crowd source and collect the coordinates of the remaining railway stations. Also verify the correctness the geocodes and the other information collected by us. For this we are exploring the option of using the Collect platform of Social Cops. This platform is handy to collect primary data.
  2. We are also planning to collect details like number of platforms in each station, contact details of station manager, amenities (foot over bridge, water taps, rest rooms, etc.), railway line information and number of originating, terminating and halting trains in those stations.
  3. We further plan to evaluate the usability of the Interactive Visualization. This will help us to analyze the visualization techniques used and which other visualization ways can be used. Click here to see how we tried to understand the way users interact with Interactive Data Visualizations.

Useful Links

  1. Data and Code on GitHub.
  2. Interactive Visualization.
  3. Datameet: Shape files for India map.

Acknowledgement

I thank Prof. Venkatesh Rajamanickam, IDC, IIT Bombay for guiding me throughout the project. I also thank Rasagy Sharma and Sajjad Anwar of MapBox for technical inputs related to the the Interactive Visualization.