Visualising the trains between Howrah Jn and Mumbai CST

7 min readSep 11, 2017

--

This article is essentially the documentation of a data visualization assignment given to us by Professor Venkatesh Rajamanickam at IDC School of Design, IIT Bombay. For this assignment, I have tried to visualize some trains on a Marey’s chart. The objective of my visualization is to draw insights from the moving patterns of:
1. all the trains originating from Mumbai CST and reaching Howrah Jn
2. all the trains originating from Howrah Jn and reaching Mumbai CST

Just to keep you interested about how I did it, let me give you a glimpse of what I was able to visualise.

A visualization (as I understand it), is a graphical representation of a piece of information that helps in quick (and oftentimes better) understanding of that information. It is meant to generate insights from data (structured or unstructured), in cases when it is not otherwise possible. Visualizing data involves telling stories with data. It helps to connect pieces of information and discover new relationships between them.

The dataset and the plan

We were given the Indian Railways time table data for passenger trains that commuted all over the country. The data was accessible in .csv format which can be edited with any spreadsheet software. For this assignment, I chose Microsoft Excel. Head over here to access the data file I worked on, or head over here for the original data set from source. This dataset was sourced from www.data.gov.in, which is a trusted source of data for the country. The attributes it had were the train number, train name, intermediate stations of every train with their ISL (Intermediate Stoppage List) numbers and their arrival times, departure times, distance from the source station, the source station code and name, and destination station code and name for every train. After the initial scanning of the attributes, I decided to visualize all the trains between two stations to see if I can come up with some interesting insights. I wanted to do more, but the time we had was only a week and therefore, I had to scope the assignment accordingly.

Indian Railways Time Table for Trains Available for Reservation as on August 2015 (Data Source: www.data.gov.in)

Initial plans of making an interactive visualization

I planned to make an interactive visualization (something like the diagram shown below).

Putting the plans for making the interactive visualisation on paper first

Being not too familiar with coding, I explored several tools like Tableau (does not require much coding), Dygraphs (code intensive), and was brave enough to even try D3.js (definitely code intensive). But in such a short time learning a code intensive tool was quite difficult. Also, several other problems popped up which I would like to document (in case anyone reading this might want to give it a shot).

Suppose the stations I choose are A and B. Now, apart from trains which have A and B as the source and destination stations, there are also trains which just cross over A and B. For visualizing those, I would require writing a script that could match and extract the trains crossing over A and B simultaneously from such a huge dataset. It was downright difficult for me!
Also, I found that Dygraphs cannot be directly used for such a visualization because it demands quantifiable entity on both the axes, and I had the ordinal list of stations on one axis.

Switch to a static visualization

I was very sad to switch to a static visualization, but I had no choice. The deadline would not be postponed and my coding skills would not improve overnight. Initially, I thought that static visualizations were easy to make. Guess what! I was wrong. In an interactive visualization, one can hide data in layers and present it only when it is demanded. This causes no clutter. On the other hand, static visualizations are (as the name suggests) static. No data can be hidden and presented later. If not clearly thought of, this type of visualizations can get highly cluttered and difficult to read. Also, making every single detail manually, with precision, can get super monotonous and taxing.

Getting the required data from the source files

I did not need the entire .csv file for working. So, I decided to extract the data I needed. I am documenting how I extracted the data for all the trains originating from Howrah Jn and reaching Mumbai CST. The data for all the trains originating from Mumbai CST and reaching Howrah Jn can be extracted similarly from the source file.

**Step 1.** Sort the file by Source Station (in any order) and extract out the entries which have Howrah Jn as the source

**Step 2.** Paste the extracted information in a different sheet. Then proceed to sort this sheet by the Destination Station (in any order)

**Step 3.** Find the entries which have Mumbai CST as the Destination Station and keep them (you will need them). Go ahead and delete the rest.

**Step 4.** Sort the sheet by Train number (in any order)

**Step 5.** Visually group the the same trains using different colours. Then proceed to sort the sheet by the ISL numbers (in ascending order). This will give you the route of each train.

**Step 6.** Well well well… Now you have an organised sheet with all the train routes easily readable. The colour coding helps, doesn’t it?

**Step 7.** Now I want to calculate another attribute: the train halt at each station. But I just cannot subtract the Arrival time from the Departure time as they are in ‘string’ format.

**Step 8.** To convert the Arrival and Departure time to a ‘number’ format, first select both the columns and find all single quotes (‘). Now replace all of them with nothing.

**Step 9:** Voila! Both the columns are in ‘number’ format now

**Step 10:** You can go to the advanced settings of cell format and choose the ‘time’ format you prefer.

**Step 11.** Now you can easily calculate the halt times using the formula =TEXT(Departure time — Arrival time, “h:mm”)

**Step 12:** Decide the precision you want and you are set to go.

Note: At Source and Destination stations, there may be huge halt times. One has to ignore these outliers. Also, when the Arrival time is in AM and the Departure time is in PM, there may be an error (#VALUE!) message displayed. In such cases, one has to manually calculate the halt times.

Plotting of data on the chart

I decided to use an A3 size for my static chart. The visualization area was decided to be 12 inches x 10 inches. I had to plot the time scale on the X axis and the stations on the Y axis.

The time scale (1 day) on the X axis

To plot the stations on the Y axis, I needed a common scale of distance of the stations from the source. To do that, I proceeded as follows:

**Step 1:** Copy the columns ‘Station Name’ and ‘Distance’

**Step 2.** Paste these two columns on a third sheet

**Step 3.** Sort this sheet by the Distance (in ascending order)

**Step 4.** Now select both the columns and remove the duplicate entries

**Step 5.** Yeah, remove all those duplicates!

**Step 6.** We have a list of all stations on a common scale of distance.

**Step 7.** The maximum total distance of the route is 2176 km

I needed to now scale 2176 km on a scale of 10 inches. Lessons from 5th class Percentages came to rescue.

**Step 8.** To get the scaled measurements, apply the formula = Distance/ 2176 * 10

**Step 9.** To get the intermediate distances between stations on scale, apply the formula = Distance of this station from source — distance of previous station from source

Why not a single chart?

I did want to plot both side routes on a single chart but I could not. The reason for this was that the trains while returning did not trace the exact same route. This would lead to counting of the same station multiple times on the Y axis. This would increase the clutter and the visualisation might not have been effective.

The green marks show the repeating stations. Hence, I discarded the idea of a common chart and decided to make two separate charts.

Graphic elements in the chart

Stations

While the Source and Destination stations were marked with larger periods, the intermediate stations were marked with smaller periods to distinguish the two at a glance.

2. Continuation

3. Use of colours

A colourblind friendly colour palette was chosen (Source: http://mkweb.bcgsc.ca/colorblind/)

4. Additional information

Appropriate legends document the train number, train name, the days on which they run and their total time of travel

Insights from the visualisation

I wanted to visualise information like halt times, train speeds over a part of the route and a comparative analysis of their travel routes. I was quite satisfied with what I achieved. We know that if a train is running slow in certain parts it is because it has to pass over a busy junction. We can analyse now which junction is causing the train to run slow. We know that if a train is halting for long periods, it is because it might have to fill up water or change coaches. This visualization allows one to analyse which station facilitates such tasks.

Links to the resources and visualisations

Although I have given the links of the files and the visualizations over the length of this article, just for the sake of redundancy I shall do so again. One can find the dropbox folder containing the resources here.

Thanks for reading, hope you find it useful. I would be very happy if you point out mistakes and the parts where I can improve. Tips on improving the efficiency of workflow would be highly appreciated. And someone please tell me how to code this stuff into an interactive visualization! You shall have my blessings.