How data can help fix NJ Transit

6 min readMar 5, 2018

This article is the first of an ongoing collaboration between Michael Zhang and I using data to examine the performance of NJ Transit. You can find the second article by Michael here. If you’re interested in more like this, or getting your hands on the data, please follow me to get notified of new articles in the future. Thanks.

I commute to New York City everyday on NJ Transit. At times, negotiating the trade off between packing into an overflowing train or possibly being late to work leads to moments of peak anxiety. But then, I step back and thank myself that I’m not in one of the infinite timelines where I’m commuting around Mumbai. Thank you, parents, for moving to New Jersey.

NJ Transit in February 2018

In any case, as far as public transportation in America is concerned, NJ Transit is one of the largest commuter rail networks in the country. With over 88 million annual riders, NJT is sandwiched in second place between the MTA’s Long Island Rail Road and Metro North.

Despite so many constituents depending on the system, NJT has fallen into disrepair over the past decade. Years of neglect have turned NJT into an unpredictable and unreliable system to the thousands of commuters that use it everyday. I’d like to use data to identify issues with NJ Transit and provide potential solutions to improve rail performance.

The only thing is… I couldn’t find any particularly helpful public data to use for my analysis. NJ Transit provides aggregate data that simply classifies trains as “on-time” or “late”. I was looking for granular (train-level), high resolution (~1–2 min accuracy) records to dive in to. So, I wrote a program that’s been scraping and parsing NJT and Amtrak data for the past month. I’ll be writing an explanatory post on how this works and releasing the scraped data openly in the coming weeks.

Every month or so, NJT publishes summary statistics about the high-level performance of the system. In January 2018, 87.5% of trains were “on time”, which seems pretty rough right off the bat:

Fig 1. Source: NJ Transit. Their goal is 94.7% on time performance.

Sure, it sounds reasonable that about 1 in 8 trains is a bit late due to some issue or the other. What you may not realize, however, is that NJT defines “on time” as trains that arrive at their final destination within 6 minutes of their scheduled time. To really gain an understanding of system performance, I’m going to use the data set I’ve been collecting for the past month, which captures train performance on a stop-by-stop level with a resolution of ~1–2 minutes.

First, let’s define a more reasonable on-time metric and see how delayed trains have really been. Here’s my “actual train delay” number line:

How does NJ Transit stack up against this scale? Let’s take a look at system-wide delay this past month (Feb 5 — Mar 4). Each bar here tracks what percent of trains on a given day fall into the categories on our “actual train delay” scale:

Fig 3. Reference line at 80%. Weekends/holidays marked with * . Interactive Plotly link. Note: Excludes Princeton Shuttle.

Systemwide Performance (2/5/18–3/4/18) by category

Of the approximately 16,267 NJ Transit trips made, 74.9% of trains were on time across the entire system. In the past month, the stacked bar chart (Fig. 3) reveals that there were only six days where > 80% of trains were on time.

The Rush Hour Commute

While this is bad enough in it’s own right, system-wide performance includes trains at all hours of the day across all NJ Transit lines. Most passengers aren’t taking trains in the wee hours of the morning; no, they’re commuting to and from New York City during peak commuting hours. Let’s look at the performance of trains scheduled to arrive at New York Penn station within the morning rush hour (8 am — 10 am):

Fig 4. Reference line at 50%. Interactive Plotly link.

To NY Penn A.M. Rush Hour Performance (2/5/18–3/2/18) by category

Approximately 43% of rush hour trains into New York Penn Station were delayed in the month of February. Further, the stacked bar chart (Fig. 4) indicates that the system failed to deliver 50% of trains on-time to Penn Station on seven days. Let’s try putting these numbers into perspective. In a given five day work week, you have a 94% chance of being delayed into New York Penn at least once. The likelihood of you experiencing a delay of greater than 6 minutes in a week of commuting is 73%. Perhaps most telling of all, over 2 weeks of commuting, you have a 77% chance of experiencing at least one delay of greater than 10 minutes. Basically, if you commute on NJ Transit to NYC, you’re gonna be late… and you’re gonna be late often.

Another important time for riders is the rush hour commute home after work. Is the system serving commuters going out of NYC better than it serves them going in? Here’s the performance of evening trains departing NY Penn between 4pm and 7pm:

Fig 5. Reference line at 50%. Interactive Plotly link.

On the bar chart (Fig. 5), we see that on-time performance dips below 50% on 11 of the 19 work days this past month. While this could indicate that performance may be worse during the evening, over 40% of trains are only slightly delayed. In fact, only 14.8% of trains are delayed for longer than 5 minutes leaving NY Penn in the evening, compared to 23% for inbound morning trains. It seems that longer delays are more prevalent among morning trains, while evening trains are being held for a few extra minutes before departure.

Then, there’s that massive gray bar on March 2 (Fig. 5) when a Nor’Easter blew through the region and brought the system to a standstill. That evening, 18 of 32 trains out of NY Penn got cancelled and many others were incredibly delayed.

This is why commuters are so frustrated with NJ Transit. The majority of riders are using NJ Transit precisely when it is most prone to getting delayed. Visualizing rush hour performance across the past month displays how vulnerable the system is to service interruptions. Next time, we’ll dive deeper into the March 2 service interruption as a case study of how the system breaks down causing delays to pile up.

Where do we go from here?

This article is a preliminary, descriptive exploration into the performance of the NJ Transit system. My end goal is to integrate various other streams of data in an ongoing effort to identify the most major issues plaguing the system. Here’s a preview of analyses that can further describe weaknesses in the system, possibly predict failures of the system, and hopefully prescribe fixes to improve performance:

Weather data (predictive) may reveal how different conditions can affect and predict delays, as well as identifying factors such as high winds or precipitation that lead to system failures.
Service advisories (prescriptive) can be text mined to see what type of infrastructure issues are consistently leading to delays and where. NJ Transit could use this information to improve resource allocation for repairs.
Amtrak data (descriptive, predictive) can provide a more precise understanding on how the two systems interact with each other. For example, we may find a threshold number of concurrently running trains over which delays are more likely.

Beyond this, it’s personally interesting for me to use this data to help frustrated commuters understand how the system can be fixed moving forward. I think public transit is a great service that makes our daily lives more sustainable, and I’d like to do what I can to help fix my local transit network.

If you enjoyed this bit of analysis, please 👏 this article below. Also, follow my Medium profile, as well as Michael Zhang, to get access to this data (soon!) and read more NJ Transit analysis in the coming months! Thank you!

How data can help fix NJ Transit

NJ Transit in February 2018

The Rush Hour Commute

Where do we go from here?

Written by Pranav Badami