Visualizing Halt Times of Passenger Trains at Stations of Mumbai

Manasi Mankad
10 min readSep 10, 2017

--

Visualizing data involves telling stories with data. Visualizations engage the viewer far more than a wall of text and drive interesting insights. So when I received a data set for an academic assignment to create a visualization, I asked myself a few important questions before I began the process of data visualization. This post documents the process followed while attempting to accomplish the goals that were created while asking these questions and telling a story based on the given data.
I have provided links and sources to all the tools and resources accessed within the post itself, but all of these links have been consolidated and shared at the end of this post too.

The final visualization. Published here

The Data

The data that was chosen for the class to attempt for the assignment was the Indian Railways time table for passenger trains that commuted all over the country. The data was accessible in .csv format (which can be edited in any spreadsheet software, for this assignment I chose Google Sheets. Head over here to access the data file I worked on, or head over here for the original data set from source) and was sourced from www.data.gov.in, which is a trusted source of data of the country. The attributes it had were the train number, train name, intermediate stations of every train and their arrival times, departure times, distance from the source station, the source station code and name and destination station code and name for every train.

Indian Railways Time Table for Trains Available for Reservation as on August 2015 (Data Source: www.data.gov.in)

So the first thing that I did was to look at the data carefully and see if there were any initial observations I could make that would help me think of a story to tell through the data. Some of those basic observations were that the arrival times (which were initially in the form of strings) that were “00:00:00” imply that the train is at the source station. This is confirmed by the ‘Distance’ attribute, which is 0 at those stations. Similarly when the departure time is “00:00:00”, it implies that the train is at it’s destination station. This is further confirmed by the fact that the ‘Station Name’ and the ‘Destination Station Name’ values match. One assumption I had pre-requisite knowledge of was that train numbers that are immediate consecutive numbers by value are essentially one physical train that plies on that route, differentiated only by direction. For example, from the screenshot, train number 01081 plies from Bhusaval Junction to Dadar, and train number 01082 is the same train going in the opposite direction, i.e from Dadar to Bhusaval Junction, after a halt at the end station of a few hours. The train name changes with the number too, but the route is same, just differentiated by the direction and, of course, the timings.

The next thing I did was ask myself and the data some defining questions, like
1) What story can I tell with this data?
2) What interesting insights do I want the viewer to gain out of the visualization or what insights should the visualization depict?
3) What needs to be highlighted to gain attention? What should be the focus?
4) What would be of interest in this data?
5) Is there a possibility of overlapping this data set with another to create an interesting story with the mash-up?

With these questions in mind, after some pondering and exploring previously visualized railway data, I got the idea to visualize how long the trains halt at the stations and if that is related in any way to the importance of that station or to the rush of the station, i.e how busy it is. I had a hunch from personal observations that trains halt longer at important stations and busier junctions, so the data could narrate if this hunch was correct and if so, how much is the variability in the halt timings at the stations related to the overall halt time of the train. Due to constraint of time, I picked a subset of this data, i.e I decided to take up all the trains that halt at stations within Mumbai. This data was picked on the basis of the attribute ‘Station Name’, where I sorted the data accordingly and extracted the tuples (rows) with their attributes (columns) on to another spreadsheet. I then removed the string property on the columns that had time values and picked a 12 hour format for representation.

Subset of the data extracted from the main data: Trains halting at stations within Mumbai

The next thing that needed to be done was to calculate the halt times at each station. By default if one tries doing this by giving the simple subtraction function on the spreadsheet, it gives the wrong answer since subtracting time requires a different function.

Getting the wrong answer while using the simple subtract function
The function that gives the correct answer: it takes the time values and gives the output in the desired format, i.e “h:mm” in this case

Once the halt times were calculated, it was time to weed out the outliers. In this case, these were the entries that halt halt times that stretch into hours. That essentially meant that the train had ended it’s journey for that route and would depart after those said hours in the reverse direction. I wanted to focus on the halt timings when passengers can board and alight, and not of when the train ends it’s journeys. So the obvious outliers (like shown below) were removed. So essentially the trains that would be visualized are the ones that are just passing through the stations and not originating or terminating at those stations.

Weeding out entries that have halt times that go beyond a few minutes as those are not passenger boarding/alighting times
Data collected: Trains that are going to be visualized = 732

The Visualization

The next step was to think of a way to represent this data in a manner that conveys my story. I decided to go with a static visualization as the medium. My first idea was inspired from E.J.Marey’s visualization of train schedules, popularly known as ‘Marey Diagram’.

Marey Diagram: Visualizes the train schedule of trains plying between Lyon and Paris in 1885
Zooming in the diagram, one can observe that the break in lines are halts at the stations. These breaks lie on a common scale. I would be using this insight for my idea

Based on this inspiration, following was my initial idea for the visualization.

Sketch of the initial idea for visualization

The good thing about this visualization was that it was easy to compare the halt times within each station as they were on a common scale, but on the other hand, it was difficult to compare the halt times between the stations.
Another thing to consider was the scalability of the visualization. If I was going to be mapping 732 trains using a particular concept, it should be applicable for the main data set too, which had approximately 2,000 trains.
So the main aspects that I was focusing on were:
1) Comparison
2) Scalability
3) Easy consumption of data (Meaning it should provice insights at-a-glance)
4) Tool for execution (For this purpose I explored RAW Graphs, Google Charts and Datamatic. The final visualization was done using Datamatic)
The following are the trials and different versions of the visualizations that were tried out.

A static Gantt chart to visualize the time intervals

A Gantt Chart is a type of bar chart, developed by Henry Gantt in the 1910s, that is generally used to illustrate time intervals.
This exploration was made using RAW Graphs. The X-axis represents time on a 24 hour scale. The Y-axis has all the train names. The halt times are represented as bars, colour coded according to the stations.
Why this visualization is not apt is because for such a large amount of data, one can glean very little meaning from the bars and being a static chart, it becomes very difficult to consume the data. Also, since the differences in time intervals are quite small (1,2,3 and 5 minutes mostly), there is no significant pattern that emerges from this visualization.

A cropped image of a modified bar chart visualization that compares the halt times across stations

This next exploration is that of a modified bar chart that compares the halt times across stations. The image shown here is cropped since the width of the chart is too high to accommodate on the web. The X-axis has the train numbers and the Y-axis has the time intervals segregated by station. This makes it easier to note and compare the halt times within each station and to some extent it also allows to observe which train stops at more than one station and to compare the halt times between stations but where this visualization too fails is that the data cannot be consumed in one go and comparison among all 732 entries would be cumbersome and the style of visualization isn’t scalable.

Treemap visualization of the halt times, categorized by station (Red: Kalyan JN, Dark Green: Thane, Ochre: Borivali, Blue: Dadar, Light Green: Panvel, Purple: Andheri, Pink: Bandra Terminus)

A treemap is a space filling visualization of data hierarchies and proportion between elements. The different hierarchical levels create visual clusters through the subdivision into rectangles proportionally to each element’s value. Treemaps are useful to represent the different proportion of nested hierarchical data structures.
In this case, this works quite well since one can instantly judge visually that maximum number of trains pass through Kalyan Junction (Red) and that there seems to be a rather large halt time of one train at Kalyan Junction and another one at Andheri (Blue). The boxes also make it easier to compare the halt times both, within the station and between stations and trains.
Hence, a treemap was chosen as the appropriate output for the purpose of this visualization.

After this, I stumbled upon Google Charts, where they provide codes for different types of visualizations. I found the sample HTML code for a simple treemap. The code isn’t too difficult to understand, and I even took a small sample from my dataset and created an interactive heirarchial treemap.

This heirarchial treemap is basically a visual representation of a data tree, where each node can have zero or more children, and one parent (except for the root, which has no parents). Each node is displayed as a rectangle, sized and colored according to values that are assigned. Sizes and colors are valued relative to all other nodes in the graph. One can specify how many levels to display simultaneously, and optionally to display deeper levels in a hinted fashion. If a node is a leaf node, you can specify a size and color; if it is not a leaf, it will be displayed as a bounding box for leaf nodes. The default behavior is to move down the tree when a user left-clicks a node, and to move back up the tree when a user right-clicks the graph. The total size of the graph is determined by the size of the containing element that is inserted on the page. Following is the code and the screenshots of the working interactions. One can also access the working interactions under the tab ‘results’ below.

Refer to the HTML code and view ‘Result’ of the test
An interactive visualization of a small sample of the data
Tooltip appearance and highlight of the element upon hover
Upon clicking on the previously hovered ‘BORIVALI’, the child elements appear. The user can navigate to the main view upon right clicking. The sizes and colours of the elements are relative to each other. The coloured slider above shows the position of the value in relation to the other nodes

But, there was a flaw in this method for this particular data in terms of scalability. In this data, there are many trains that stop at multiple stations, so there will be entries with the same train number but different station name. The code method used does not allow for “duplicate” entries and hence this data cannot be coded using this method. The following is the error I received when such an instance occurred.

Error of “duplicate” data

The next step was to refine the previous visualization and/or finding better tools that helped visualize this better. One such tool was Datamatic. This tool helped me visualize the treemap and also made it interactive by adding a tooltip to see the property of the element upon hover. Following is process for the final visualization done using Datamatic.

Visualizing using Datamatic: Input the data, add your own categorization and tweak the properties according to the visualization required
Visualization taking shape! Adding all the values and tweaking the properties
Exporting the visualization
Final visualization (Legend- Blue: Andheri, Orange: Dadar, Pink: Bandra Terminus, Green: Borivali, Yellow: Kalyan JN, Maroon: Panvel)
Appearance of the tooltip with halt time information

This visualization has been published here https://goo.gl/9fteHo

View all links that I have included in this post here:
1) Original data set
2) Data set that I worked with (Subset of the original)
3) RAW Graphs (Tool I experimented with)
4) Google Charts_Treemap(Resources I used for another exploration)
5) Datamatic (Tool I used for the final visualization)
6) Final visualization

Thanks for stopping by!

--

--