An open-source tool to help improve the quality of real-time transit data
Real-time transit information has been shown to have many benefits to transit riders, including shorter perceived wait time¹, shorter actual wait time¹, a lowered learning curve for new riders², and increased feeling of safety (e.g., at night)³ ⁴. Transit agencies who have deployed real-time information have also benefited from increased ridership⁵ ⁶, as well as a better perception of the agency and it’s transit service, even if it’s service hasn’t actually changed⁷.
The General Transit Feed Specification (GTFS) format⁸, which has become the dominant format for open schedule data in the transit industry and shared by over 1,500 agencies worldwide⁹, has enabled many applications to show transit schedule information. In the last few years, a real-time counterpart to GTFS, GTFS-realtime (GTFS-rt), has begun to emerge, with agencies sharing their real-time predictions, vehicle locations, and service alerts in this format. Previously, real-time transit information had only been shared in proprietary formats specific to each vendor or agency.
GTFS-rt offers the opportunity for application developers to create a mobile app that can function across a large number of cities and agencies, and for practitioners and researchers to be able to easily study and compare actual system performance across different transit systems using the same tools, without the overhead of manually transforming data into a consistent format. Having real-time transit data available in a common format is a key pillar for real-time multimodal information systems.
Quality is important!
Of equal importance to data availability, however, is data quality. In fact, accuracy of real-time information is a key concern of transit riders. A survey of riders of a mobile transit app showed that 84% rely solely on real-time information instead of using the schedule⁴. Errors in predictions create a negative perception of the mobile app providing the information as well as the transit agency. For example, 74% of surveyed Puget Sound transit riders considered a difference between actual and estimated arrival times greater than 4 minutes as an “error”. In addition, 9% of surveyed riders said that they took the bus less often due to errors they experienced⁴. Prediction errors can also lead to reduced system performance if operations is making decisions based on this data.
We need better tools
The GTFS-rt format is relatively new, and, as with any emerging data format, challenges quickly emerge. One key challenge is that while the GTFS format for schedule data has several open-source GTFS feed validators, no such open validation tool has existed for GTFS-rt.
The process to manually identify and troubleshoot problems in GTFS-rt feeds can be extremely time consuming due to the scale and frequency of rapid change. For example, in November 2017 Massachusetts Bay Transportation Authority (MBTA) in the Boston, Massachusetts area had a GTFS dataset that contained 71,260 trips and 1,809,833 stop time records. MBTA’s GTFS-realtime feed contained data for 489 vehicles with independent arrival or departure predictions for most stops on active trips that is refreshed around every 5 seconds. Additionally, the process to examine feeds requires a significant amount of expertise with the GTFS and GTFS-rt specification, which limits the number of people that can evaluate a feed for potential problems.
To address these problems, our research team created an open-source GTFS-realtime Validator software tool that can monitor GTFS-rt feeds (Trip Updates, Vehicle Positions, Service Alerts) and log any encountered problems.
The user simply enters URLs for their GTFS and GTFS-rt datasets, as well as how frequently the tool should fetch GTFS-rt updates (the default is 10 seconds). After starting the monitoring session, the user is shown a log view with the types of errors logged for each iteration (i.e., fetch) of the GTFS-rt feed. The user can click on the iteration ID to see all the occurrences of the errors and warnings for that iteration (as shown above) — there can be multiple occurrences of most errors and warnings in a single feed iteration. Because GTFS-rt feeds can be updated every few seconds, the tool enables an observer to capture critical data for troubleshooting problems in a log format that can be browsed and saved for further analysis.
Adding new rules
The GTFS-realtime Validator has a modular rule architecture that allows new errors and warnings to be easily added to the tool as the GTFS-rt specification continues to evolve and new problems are discovered in feeds. So far we’ve implemented rules to detect over 45 types of errors and 9 types of warnings that appear in feeds, many of which we’ve encountered when working with real feeds.
An error is logged when data in the feed is incorrect and would result in a transit rider seeing bad or missing real-time information as a result. A warning is logged when a feed contains data that would negatively affect some GTFS-rt consuming applications but either cannot be confirmed to be incorrect with 100% certainty based on data in the feed (e.g., a very large speed value for a vehicle) or the GTFS-rt specification does not clearly indicate that the data or behavior is incorrect (e.g., it is a best practice to refresh feed contents frequently, but the GTFS-rt specification doesn’t require a minimum update frequency). A detailed description of all rules is documented on GitHub.
It is important to note that as of December 2018, the GTFS-realtime Validator tool does not detect errors in the arrival or departure predictions themselves (i.e., whether a vehicle actually arrived or departed when it was predicted). Current rules therefore focus on data integrity (i.e., if the data logically correct given the GTFS-realtime specification and GTFS schedule data). Prediction accuracy analysis, as discussed late, is a potential future area of work.
Evaluation of industry GTFS-realtime feeds
To demonstrate the utility of the GTFS-realtime Validator, we developed another tool, the transit-feed-quality-calculator, to automate the validation of a large number of feeds.
This analysis tool:
- Retrieves the URLs for GTFS-realtime feeds and corresponding GTFS data from the TransitFeeds.com GetFeeds API (a centralized directory for GTFS and GTFS-realtime feed URLs)
- Downloads a snapshot of the GTFS-realtime and GTFS data from each agency’s server into a subdirectory
- Runs the GTFS-realtime Validator on each of the subdirectories
- Produces summary statistics and graphs for all validated feeds
While TransitFeeds.com shows a total of 130 GTFS-rt feeds that have been registered with the system, we have so far automated the validation of 78 feeds (future work will focus on improving this number by supporting feeds that require API keys or use HTTP redirects).
Out of the 78 feeds evaluated, 54 of the feeds contained errors, and 58 of the feeds contained warnings (see below).
“E011 — GTFS-rt stop_id does not exist in GTFS data” was the most common error, appearing in 16 feeds. E011 means that the GTFS schedule data has no record of a stop that the GTFS-rt data is showing a prediction for, indicating an incorrect stop_id either in the GTFS or GTFS-rt data. The 2nd most common error was “E022 — Sequential stop_time_update times are not increasing” appearing in 15 feeds (which indicates that predicted times are wrong — the vehicle would be traveling backwards in time). “E045 — GTFS-rt stop_time_update stop_sequence and stop_id do not match GTFS” appeared in 13 feeds — this means that the GTFS-rt data shows a conflicting order of arrival for stops for a trip when compared to the GTFS data.
Figure 4 shows the distribution of the count of error types found in feeds. For example, the feed with the worst performance had 7 different types of errors found, while 23 feeds had only one error type found. Even though the majority of feeds had 2 or fewer types of errors, as mentioned earlier, some errors can occur multiple times in the same feed iteration, as well as in multiple iterations of the feed. For example, in Feed 51 that had eight different types of errors, there were 24 occurrences of “E022 — Sequential stop_time_update times are not increasing” in a single feed iteration. Each of these occurrences can have a significant impact on the transit rider experience, as discussed in the following section.
It should be noted that all of the above analysis is for a single iteration of each of the 78 evaluated feeds. It is highly likely that if the validator was executed over several hours of time additional errors and warnings would be found for each feed. Future work will focus on enhancing the analysis tool to automate data collection for a large number of feeds over an extended time period.
Based on our experience deploying a GTFS-rt feed and multimodal transit app, as well as the development and testing of the GTFS-realtime Validator tool, the transit industry must focus on real-time data quality as well as data availability. The number of errors and warnings found in industry feeds reflect significant data issues that impact riders and, based on research, leads to reduced ridership and satisfaction with the transit agency and its service. Real-time data that contains integrity issues (e.g., trips with out-of-sequence predictions or conflicts with GTFS data) are very problematic for transit apps to parse; many transit apps, including Google Maps, the Transit App, and OneBusAway, will drop all predictions for that trip, resulting in users seeing the schedule information instead of real-time information.
The good news, however, is that research shows good quality real-time data leads to increased ridership and satisfaction with the agency. Transit agencies can focus on improving data quality by getting involved with the GTFS-rt improvement process and voting for proposals that clarify how producers and consumers should interact. Agency can also use the GTFS-realtime Validator tool when creating and maintaining GTFS and GTFS-rt feeds to ensure that no errors and warnings occur, and require that their AVL vendor (including during the Request for Proposals process) also use such a validation tool before feeds will be accepted. Feed creators such as AVL vendors can use the validator in their own product development lifecycle to shorten quality assurance testing time and improve the quality of the data.
As mentioned earlier, it should be noted that as of January 2019, the GTFS-realtime Validator tool does not detect errors in the predictions themselves (i.e., whether a vehicle actually arrived or departed when it was predicted), which is another significant source of problems encountered by riders. Future work should examine adding prediction accuracy analysis to the GTFS-realtime Validator, perhaps via integration with other tools such as TheTransitClock. Future work can also focus on enhancing the automated analysis tool to increase both the number and duration of feeds evaluated.
Feedback is welcome! Please let us know your thoughts in the comments below. Happy data wrangling!
Our work at the Center for Urban Transportation (CUTR) at the University of South Florida (USF) on the development of the open-source GTFS-realtime Validator has been funded by the National Institute for Transportation and Communities (NITC). The contents of this article reflect the views of the authors, who are solely responsible for the facts and the accuracy of the material and information presented herein.
This article is an abbreviated version of Transportation Research Board 2018 paper 18–05585 “Quality Control — Lessons Learned from the Deployment and Evaluation of GTFS-realtime Feeds”.
 Kari Edison Watkins, Brian Ferris, Alan Borning, G. Scott Rutherford, and David Layton (2011), “Where Is My Bus? Impact of mobile real-time information on the perceived and actual wait time of transit riders,” Transportation Research Part A: Policy and Practice, Vol. 45 pp. 839–848.
 C. Cluett, S. Bregman, and J. Richman (2003). “Customer Preferences for Transit ATIS,” Federal Transit Administration.
 Brian Ferris, Kari Watkins, and Alan Borning, “OneBusAway: results from providing real-time arrival information for public transit,” presented at the Proceedings of the 28th international conference on Human factors in computing systems, Atlanta, Georgia, USA, 2010.
 A. Gooze, K. Watkins, and A. Borning (2013), “Benefits of Real-Time Information and the Impacts of Data Accuracy on the Rider Experience,” in Transportation Research Board 92nd Annual Meeting, Washington, D.C., January 13, 2013.
 Lei Tang and Piyushimita Thakuriah (2012), “Ridership effects of real-time bus information system: A case study in the City of Chicago,” Transportation Research Part C: Emerging Technologies, Vol. 22 pp. 146–161.
 C. Brakewood, G. Macfarlane, and K. Watkins (2015), “The impact of real-time information on bus ridership in New York City,” Transportation Research Part C: Emerging Technologies, Vol. 53 pp. 59–75.
 C. Brakewood, S. Barbeau, and K. Watkins (2014), “An experiment evaluating the impacts of real-time transit information on bus riders in Tampa, Florida,” Transportation Research Part A: Policy and Practice, Vol. 69 pp. 409–422.
 Google, Inc. “General Transit Feed Specification Reference.” Accessed July 31, 2017 from https://github.com/google/transit/blob/master/gtfs/spec/en/reference.md
 MapZen. “TransitLand — An Open Project — For Data Providers.” Accessed July 31, 2017 from https://transit.land/an-open-project/