Visualizing Halt Times of Passenger Trains at Stations of Mumbai
Visualizing data involves telling stories with data. Visualizations engage the viewer far more than a wall of text and drive interesting insights. So when I received a data set for an academic assignment to create a visualization, I asked myself a few important questions before I began the process of data visualization. This post documents the process followed while attempting to accomplish the goals that were created while asking these questions and telling a story based on the given data.
I have provided links and sources to all the tools and resources accessed within the post itself, but all of these links have been consolidated and shared at the end of this post too.
The data that was chosen for the class to attempt for the assignment was the Indian Railways time table for passenger trains that commuted all over the country. The data was accessible in .csv format (which can be edited in any spreadsheet software, for this assignment I chose Google Sheets. Head over here to access the data file I worked on, or head over here for the original data set from source) and was sourced from www.data.gov.in, which is a trusted source of data of the country. The attributes it had were the train number, train name, intermediate stations of every train and their arrival times, departure times, distance from the source station, the source station code and name and destination station code and name for every train.
So the first thing that I did was to look at the data carefully and see if there were any initial observations I could make that would help me think of a story to tell through the data. Some of those basic observations were that the arrival times (which were initially in the form of strings) that were “00:00:00” imply that the train is at the source station. This is confirmed by the ‘Distance’ attribute, which is 0 at those stations. Similarly when the departure time is “00:00:00”, it implies that the train is at it’s destination station. This is further confirmed by the fact that the ‘Station Name’ and the ‘Destination Station Name’ values match. One assumption I had pre-requisite knowledge of was that train numbers that are immediate consecutive numbers by value are essentially one physical train that plies on that route, differentiated only by direction. For example, from the screenshot, train number 01081 plies from Bhusaval Junction to Dadar, and train number 01082 is the same train going in the opposite direction, i.e from Dadar to Bhusaval Junction, after a halt at the end station of a few hours. The train name changes with the number too, but the route is same, just differentiated by the direction and, of course, the timings.
The next thing I did was ask myself and the data some defining questions, like
1) What story can I tell with this data?
2) What interesting insights do I want the viewer to gain out of the visualization or what insights should the visualization depict?
3) What needs to be highlighted to gain attention? What should be the focus?
4) What would be of interest in this data?
5) Is there a possibility of overlapping this data set with another to create an interesting story with the mash-up?
With these questions in mind, after some pondering and exploring previously visualized railway data, I got the idea to visualize how long the trains halt at the stations and if that is related in any way to the importance of that station or to the rush of the station, i.e how busy it is. I had a hunch from personal observations that trains halt longer at important stations and busier junctions, so the data could narrate if this hunch was correct and if so, how much is the variability in the halt timings at the stations related to the overall halt time of the train. Due to constraint of time, I picked a subset of this data, i.e I decided to take up all the trains that halt at stations within Mumbai. This data was picked on the basis of the attribute ‘Station Name’, where I sorted the data accordingly and extracted the tuples (rows) with their attributes (columns) on to another spreadsheet. I then removed the string property on the columns that had time values and picked a 12 hour format for representation.
The next thing that needed to be done was to calculate the halt times at each station. By default if one tries doing this by giving the simple subtraction function on the spreadsheet, it gives the wrong answer since subtracting time requires a different function.
Once the halt times were calculated, it was time to weed out the outliers. In this case, these were the entries that halt halt times that stretch into hours. That essentially meant that the train had ended it’s journey for that route and would depart after those said hours in the reverse direction. I wanted to focus on the halt timings when passengers can board and alight, and not of when the train ends it’s journeys. So the obvious outliers (like shown below) were removed. So essentially the trains that would be visualized are the ones that are just passing through the stations and not originating or terminating at those stations.
The next step was to think of a way to represent this data in a manner that conveys my story. I decided to go with a static visualization as the medium. My first idea was inspired from E.J.Marey’s visualization of train schedules, popularly known as ‘Marey Diagram’.
Based on this inspiration, following was my initial idea for the visualization.
The good thing about this visualization was that it was easy to compare the halt times within each station as they were on a common scale, but on the other hand, it was difficult to compare the halt times between the stations.
Another thing to consider was the scalability of the visualization. If I was going to be mapping 732 trains using a particular concept, it should be applicable for the main data set too, which had approximately 2,000 trains.
So the main aspects that I was focusing on were:
3) Easy consumption of data (Meaning it should provice insights at-a-glance)
4) Tool for execution (For this purpose I explored RAW Graphs, Google Charts and Datamatic. The final visualization was done using Datamatic)
The following are the trials and different versions of the visualizations that were tried out.
A Gantt Chart is a type of bar chart, developed by Henry Gantt in the 1910s, that is generally used to illustrate time intervals.
This exploration was made using RAW Graphs. The X-axis represents time on a 24 hour scale. The Y-axis has all the train names. The halt times are represented as bars, colour coded according to the stations.
Why this visualization is not apt is because for such a large amount of data, one can glean very little meaning from the bars and being a static chart, it becomes very difficult to consume the data. Also, since the differences in time intervals are quite small (1,2,3 and 5 minutes mostly), there is no significant pattern that emerges from this visualization.
This next exploration is that of a modified bar chart that compares the halt times across stations. The image shown here is cropped since the width of the chart is too high to accommodate on the web. The X-axis has the train numbers and the Y-axis has the time intervals segregated by station. This makes it easier to note and compare the halt times within each station and to some extent it also allows to observe which train stops at more than one station and to compare the halt times between stations but where this visualization too fails is that the data cannot be consumed in one go and comparison among all 732 entries would be cumbersome and the style of visualization isn’t scalable.
A treemap is a space filling visualization of data hierarchies and proportion between elements. The different hierarchical levels create visual clusters through the subdivision into rectangles proportionally to each element’s value. Treemaps are useful to represent the different proportion of nested hierarchical data structures.
In this case, this works quite well since one can instantly judge visually that maximum number of trains pass through Kalyan Junction (Red) and that there seems to be a rather large halt time of one train at Kalyan Junction and another one at Andheri (Blue). The boxes also make it easier to compare the halt times both, within the station and between stations and trains.
Hence, a treemap was chosen as the appropriate output for the purpose of this visualization.
After this, I stumbled upon Google Charts, where they provide codes for different types of visualizations. I found the sample HTML code for a simple treemap. The code isn’t too difficult to understand, and I even took a small sample from my dataset and created an interactive heirarchial treemap.
This heirarchial treemap is basically a visual representation of a data tree, where each node can have zero or more children, and one parent (except for the root, which has no parents). Each node is displayed as a rectangle, sized and colored according to values that are assigned. Sizes and colors are valued relative to all other nodes in the graph. One can specify how many levels to display simultaneously, and optionally to display deeper levels in a hinted fashion. If a node is a leaf node, you can specify a size and color; if it is not a leaf, it will be displayed as a bounding box for leaf nodes. The default behavior is to move down the tree when a user left-clicks a node, and to move back up the tree when a user right-clicks the graph. The total size of the graph is determined by the size of the containing element that is inserted on the page. Following is the code and the screenshots of the working interactions. One can also access the working interactions under the tab ‘results’ below.
But, there was a flaw in this method for this particular data in terms of scalability. In this data, there are many trains that stop at multiple stations, so there will be entries with the same train number but different station name. The code method used does not allow for “duplicate” entries and hence this data cannot be coded using this method. The following is the error I received when such an instance occurred.
The next step was to refine the previous visualization and/or finding better tools that helped visualize this better. One such tool was Datamatic. This tool helped me visualize the treemap and also made it interactive by adding a tooltip to see the property of the element upon hover. Following is process for the final visualization done using Datamatic.
This visualization has been published here https://goo.gl/9fteHo
View all links that I have included in this post here:
1) Original data set
2) Data set that I worked with (Subset of the original)
3) RAW Graphs (Tool I experimented with)
4) Google Charts_Treemap(Resources I used for another exploration)
5) Datamatic (Tool I used for the final visualization)
6) Final visualization
Thanks for stopping by!