Real-Time MTA Subway Data Collection & Delay Computation

Published in

C2SMART Center

5 min readSep 25, 2019

New York City has the most complicated and dynamic subway network, and it remains a challenge to ameliorate the delay problems for decades. The first and critical step is to take an insight into actual delay patterns, based on which we can find the underlying reasons for delays and develop innovative solutions. This data-driven project developed a scalable framework to collect real-time General Transit Feed Specification (GTFS) data, analyze and visualize the MTA subway delay patterns in multiple dimensions including date, hour, station, and route.

DATA PROCESSING

Python package mtagtfs was created to collect real-time GTFS data from the MTA Developer API, extract actual subway arrivals, and compute delays. GTFS defines a standard format for public transportation schedules and associated geographic information[1]. This package is highly scalable as all GTFS files can be parsed and actual arrivals can be identified within this framework, such as MTA bus data as well as NYC Ferry data. Meanwhile, it is user-friendly, provided as a standard Python interface. There are 3 main steps in this ETL framework, and they are packaged as 3 functions: collect, arrival, and delay.

The first step ‘collect’ takes MTA API key as an input, continuously requests MTA subway real-time status and writes local GTFS files. On average, the real-time statuses are updated per 15 seconds, and the refresh rates are between 5 to 30 seconds. Therefore, the collection process loops about every 4 seconds to ensure data integrity.
The second step ‘arrival’ takes a date (e.g. 20190901) as an input parameter, structures and integrates nested GTFS file, and outputs an arrival CSV file. The arrival time estimations are updated dozens of times for a single arrival. ‘arrival’ extracts the last updated time as the actual arrival time. This is because the most reliable estimation of the actual arrival time is the one reported the last. To extract the last arrival estimation and store the semi-structured datasets, dictionary structure is used. Each arrival has an associated trip id plus stop id as its unique key, and the actual arrival time is overwritten only if a newer record comes in until the last update.
The final step ‘delay’ computes delays by comparing the differences between actual and scheduled arrivals. It takes a target date as an input (schedule releasing date is optional) and outputs a delay CSV file. This function automatically fetches the latest schedules by default. It also supports the historical schedules if the schedule releasing date is assigned as a parameter. A pair of actual arrival and scheduled arrival is matched together by weekday, trip id, route id, direction, and stop id.

Biases and limitations exist in these processes. First of all, it is a sampling process. Not all arrival data is uploaded in the MTA real-time status system; abnormal data points exist (abnormal time/ trip id) as well. Secondly, as actual arrival extraction & delay calculation make computation day-by-day and 12 AM is a cut-off point of days, the results may have larger errors around 12 AM. For future work, the sampling rate will be reported, and new algorithms will be developed to precisely calculate the actual arrivals and delays around 12 AM.

VISUALIZATION APPLICATION

Interactive web visualization was developed. Users can customize the map and statistical analysis filtered by date, hour, station, route, or location by clicking columns, map or dragging sliders.

Fig.1 Interactive Web Visualization Demo

This visualization is built based on the data collected from Feb 18, 2019, to Mar 10, 2019. The subway delay patterns can be intuitively detected. It is easy to explore when and where the delays occur the most, which routes or segments accumulate the most delays. For research purposes, delays of peak/ flat hour, delays on weekdays/ weekends, delays in midtown/ suburbs can be compared; influences of specific events can be quantified; delay spreading/ assembling patterns can be identified. For personal use, delay statistics of the work/ home stations and commuting routes can be explored.

RESULTS

As a result, several delay patterns are found:

Subway delay slightly assembles from suburbs to midtown in rush hour from 7 am to 10 am, and significantly spreads from midtown to suburbs in evening rush hour from 4 pm to 7 pm.

The points on the maps show two distinguishable kinds of delays: stop delay and line delay. Stop delays concentrate upon a single station, usually an interchange station. It may be mostly caused by operation difficulty for multiple trains and influence adjacent stations. Line delays spread through a line, which influences segments of a single line. At the same time, they are not mutually exclusive. They can influence and cause each other.

As for the weekends, the figures show a constant delay pattern through-out the day instead of discernible spreading/ assembling patterns. On average, the delay on weekends is about three times of that on weekdays.

Fig.4 Constant Delay Pattern on Weekends

Three-hour lag is detected between arrival increase and delay increase. Fig.5 shows subway delays significantly increase 2~3 hours after the number of arrivals increase.

Fig.5 Hourly Average Delay & Number of Arrivals

[1] Google Transit API > Static Transit

Real-Time MTA Subway Data Collection & Delay Computation

DATA PROCESSING

VISUALIZATION APPLICATION

RESULTS

Written by Junjie Cai