What the GTFS is going on?!?!

This week I started working on a project using New York City subway data. When I made a request to the MTA’s API, to test my key, I got a surprise!

It turns out that the MTA’s data feeds, as well as most transit data available, uses a data standard called GTFS.

GTFS stands for General Transit Specification, it is an open data format for public transit data created by Google. There are two categories within GTFS, static and realtime, and they operate slightly differently.

Before GTFS there was no standard for transit data, so it’s introduction led to quick adoption by both developers and transit agencies.

GTFS Static

A static GTFS feed is a .zip file that contains between 6 and 13 CSV tables, each stored in a .txt file. Each of the CSV tables contains the data for one category of information about the transit system in question. The six mandatory tables are agency, routes, trips, stop_times, stops, and calendar. The 7 optional tables include things like fares, transfers, and frequency.

For a list of the fields in each table check out Google’s documentation.

GTFS Realtime

GTFS Realtime is an extension of the GTFS format that allows transit agencies to provide access to real-time data about transit systems. This means developers have access to information about Trip updates, Service alerts, and Vehicle positions. The information from the feed is returned as a binary file which is then decoded using protocol buffers.

Protocol buffers specify how you want your data structured and are supplied as a .proto file which determines how your data is serialized when it is decoded. If that sounds complicated don’t worry, Google has a repo of GTFS Realtime language bindings that can do most of the work for us!

Let’s write a quick little ruby program with Sinatra, to test that everything is working correctly. I have hardcoded feed_id=21 into this example, which will give us data from the B/D/F/M line.

When this is run, we receive an array of objects. Each object represents a train and contains two sub-objects. The first is called "trip" and contains a trip id, the trip’s start time and the train’s route id. The second is "stop_time_date" which contains an array of objects containing arrival and departure information for each stop that the particular train has been to.

Now we have a bunch of information about train times and stops, but what can we do with it?

We can combine it with the information that we pulled from the static GTFS feed! We now have information about which stop corresponds to each element’s "stop_id", including the stops name, and GPS coordinates. We also have access to information about which stations accept transfers, route data, and a lot more.

By combining information from the static GTFS feed and the GTFS Realtime feed, we have a very complete set of information about the transit system… Except for what’s going to happen when they shut down the L Train!

As you can see GTFS is a really robust system for dealing with transit data, and because it was adopted so universally, you can access transit data from any city that uses it in the same way. After a quick look at the internet, it seems like there are over 1000 cities around the world that provide access to GTFS data about their public transit!