Taming the MTA’s Unruly Turnstile Data

Published in

qri.io

9 min readMar 31, 2020

tl;dr We have created clean, ready-to-use daily station-level turnstile counts datasets for 2019 and 2020. Go get them!

Note: The first versions of the 2019 and 2020 daily counts referenced in this blog post contained a major error, showing massive (inaccurate) spikes of traffic on Saturdays. These errors have been corrected, but users who cloned the earlier versions are advised to make sure they have the latest. See the respective dataset readmes for more info.

The New York MTA publishes weekly turnstile data on its developer page, but it’s far from usable in its raw form. I’ve messed with it in years past, and have always wanted to build a data pipeline to produce a cleaner, more usable aggregation of the data. Read on for details about the transformation process we applied to create a dataset of usable daily entry and exit counts by station.

There are myriad uses of this data for understanding the demand on our aging and generally crowded subway system. Ben Wellington recently analyzed the decrease in subway ridership in the wake of the COVID-19 pandemic, showing that poorer neighborhoods experienced a much smaller decrease in ridership than wealthier ones. This article in The City outlines the “financial calamity” faced by a system that is experiencing dramatically lower demand.

About the Data

The MTA publishes a CSV every week, and adds the most recent download link to the top of a very long list on their website. There are no fancy open data portal frills here, just CSVs and some supporting files to explain what each field is, and how to connect a row of data to a specific subway station.

Each weekly CSV contains about 200,000 rows, but a quick glance may leave you scratching your head. Each row is an observation at a moment in time for a specific turnstile.

The readings occur every 4 hours
The 4-hour observation windows are not the same for all turnstiles
Some of the turnstiles are counting down, not up. ¯\_(ツ)_/¯
Each turnstile that’s keeping track of counts is identified by its “subunit channel position” or scp, unique within a control area.
Each turnstile is in a “control area”, which is the bank of turnstiles associated with a subway exit or exits, unique within a remote unit.
Each control area is in a “remote unit”. Remote unit generally represents a single station, but some stations and station complexes have several associated remote unit ids.

The most important thing to know is that the entries and exits columns are not counts of people for a given timeframe, they are more like odometer readings. To know how many entries there were at a given observation, you must know the reading from the previous observation and calculate the difference.

For example, we can see that for this turnstile there were 66 entries and 71 exits between 7:00am and 11:00am on December 28th, 2019.

While this seems like a fairly straightforward calculation, the situation gets more complex when you want to aggregate up to something useful.

One Table to Rule Them All

The first step in our data pipeline requires combining all of the weekly CSVs into a single table in a database (I’m using PostgreSQL). I won’t bore you with a lot of SQL here, for those who want the gritty details, see the github repository.

I loaded everything from the beginning of 2020 to the latest published CSV, resulting in a table with 13.6 million observations. To make life easier, a few modifications to the raw data are made:

Combine the date and time columns into a proper timestamp named observed_at . This allows us to use the database’s time functions to quickly group by day.
Concatenate remoteunit, controlarea, scp , and timestamp into id , and enforce uniqueness. This prevents duplicate observations for the same turnstile and timestamp (these exist in the raw data for various reasons)
Concatenate remoteunit, controlarea, andscp into unit_id , which describes a single turnstile. This is useful for calculating the net entries and exits
One big caveat: We also never allownet_entries or net_exits to be greater than 10,000. Because the turnstile readings are like odometers, they can only count so high. When they roll over, the result is astronomically large net values that need to be thrown out. 10,000 is a reasonable cutoff for excluding these errant calculations, as it represents 2,500 persons per hour passing through a turnstile, or 41 persons per minute.

This approach to setting up the data was borrowed from Todd Schneider, who has created a similar database to power a live dashboard of system-wide counts. His code is on github, and was extremely helpful in providing an efficient way to do the net_entries and net_exits calculations after a new CSV of data is added to the main table. Thanks, Todd!

Once we have a clean table of unique observations including net entries and exits, for each turnstile, we can start rolling it up into more useful aggregations.

Aggregating Up

The goal is to have a dataset that includes daily counts by station. The first task is to combine the entries for a given day for each individual turnstile.

This consolidates our 13.6M observations into 2.2M records. Each shows the entires and exits for a single turnstile. One thing to keep in mind here is that these aggregates may reflect turnstile traffic that did not fall within the specified 24-hour date. Rather, they are showing the counts for the 4-hour observation periods that ended on that date.

This means if an observation was taken at midnight, it will include traffic from 8pm to 12am of the previous day.

I have considered shifting the date calculation by 2 hours to more accurately credit the late night/early morning observation windows, but for now I left it as a simplistic date calculation (12:00:00–23:59:59 is one day)

Unfortunately, it’s not as simple as aggregating on the station column or the remoteunit column. There are situations where a single station has multiple names and multiple remotes. To get an accurate aggregation, we need to combine all of the turnstiles that are topologically connected by station platforms and corridors.

The MTA website where the turnstile data is published comes with an excel spreadsheet called Remote-Booth-Station that should be able to help with this. Here we see Fulton Street showing up in two different places, with two different remote fields. There’s actually a third cluster of Fulton Street records in this table. This is because it used to be three different stations, but its now a “station complex”.

We have no easy way of knowing from this table whether a station is part of a station complex. For Fulton Street, the station names are the same, but this isn’t always true. Here we see the station complex for Lexington Ave — 59th Street. It’s got two remotes , and goes by two different names.

If we were to use this as a lookup table, we could associate the station with a turnstile, but would have no clear way to aggregate all of the records to a single station complex. This table is also missing newer stations like the 2nd Ave Subway and Hudson Yards.

What we really need to move further is a table that clearly links stations, station complexes, and remotes . The MTA publishes another table called “Station Locations” that is a much more detailed authoritative list of stations, and it has a column for complexes! (It’s also got some better-formatted station names)

Now we’re getting somewhere. This has a unique id for each station, and indicates whether the station is part of a complex, which also has a unique id. The table has plenty of station-level goodies, including lat/lon so we can map them. But how can we associate each one of these with a remoteunit in the turnstile data?

That’s going to take elbow grease, but it’s critical to being able to aggregate the turnstile data in a way that makes sense. I added a new column complex_id to the remote-booth-station table by manually looking up individual stations in the stations list above. This was the crucial missing link between the turnstile observations and the official list of subway stations and complexes.

Now we can aggregate by complex, and join with the goodies in the station list, giving us a nice, clean count of entries and exits by station by day. Here we see Astoria-Ditmars Blvd from January 1st through 13th, 2019. You can clearly see the weekday-weekend pattern, where weekday entries are cut in half on Saturdays and even lower on Sundays.

A quick line chart shows the “normal” pattern of weekdays versus weekends in early 2020, followed by sharp declines in mid-March as Covid-19 countermeasures took effect. (Weekday ridership the week this post was written is about 10% of normal)

Scroll down to Publishing the Aggregated Data to learn how to clone the 2019 and 2020 daily counts from Qri.cloud

But Wait, there’s More!

In my data munging, I discovered that there are remoteunit values in the turnstile data that didn’t align with any stations. Further investigation showed that the dataset also includes PATH trains AND the Roosevelt Island Tramway, which makes sense because both of them accept metrocards and have turnstiles!

Chart showing daily entries in 2020 at the Manhattan Tramway Terminal

I appended the PATH and Tramway stations to the station list to make this data pipeline complete. See below for links to the data.

Publishing the Aggregated Data on Qri

I’ve published the following datasets on Qri.cloud, along with metadata and a readme for each:

nyc-transit-data/turnstile_daily_counts_2019 — Daily entries and exits by station/complex for 2019
nyc-transit-data/turnstile_daily_counts_2020— Daily entries and exits by station/complex for 2020
nyc-transit-data/remote-complex-lookup — The “missing link” lookup table to connect the turnstile remote unit id to the station list
nyc-transit-data/turnstiles-station-list — The list of station names, latitudes, longitudes, and other key station-level information that ends up in the aggregated turnstiles datasets

You can quickly clone them to your computer using Qri CLI. To get the 2020 daily counts dataset, use qri add nyc-transit-data/daily_station_counts_2020 .

If you’re not comfortable with the Command Line, you can use Qri Desktop to clone the data to your local machine using the username/datasetname listed above. Get started with Qri Desktop.

Keeping it Up to Date

I’ve set up this workflow to be easy to update. It should take only a few minutes to import a week’s worth of data, add it to the main observations table, run the aggregations, and cut a new version of the 2020 dataset.

If the 2020 data is stale, I haven’t been running the update. Give us a yell on twitter if you are in need of that fresh 2020 turnstile data!

Thanks for reading, and please let us know if this dataset is useful to you and how we can make it better!

Update — June 19th, 2020

Raw data is fun and all, but to really get a feel for what’s possible with this dataset, you can check out https://www.subwayridership.nyc/. Transit-techie and software engineer Sunny Ng has presented the 2020 turnstile counts dataset in an interactive web map, allowing users to quickly see station counts for any specific date and station! It includes station-level time-series charts and data-driven map markers, providing several ways to analyze and compare stations.

www.subwayridership.nyc, an interactive web map showing individual daily station entries and exits for 2020

Best of all, Sunny also discovered some inconsistencies and potential improvements in our data pipeline, and was able to contribute some fixes via Pull Request! Thanks Sunny!

Taming the MTA’s Unruly Turnstile Data

About the Data

But Wait, there’s More!

Publishing the Aggregated Data on Qri

Keeping it Up to Date

Written by Chris Whong