Determining which subway stations people use after a night out in NYC

5 min readJan 24, 2018

I recently started the Metis data science bootcamp in San Francisco. For our first team project in the class, we were tasked with analyzing MTA turnstile data and discovering some insights. Being a longtime resident of NYC, I was particularly excited to be working with actual MTA data. Our group of 4 wanted to avoid analyses that ultimately rank overall foot traffic since that would simply surface the busiest stations — Grand Central, Penn Station, etc. After discussing whether we’d be able to analyze the effects of the L-train shutdown or not in the 2 days or so we had to do the project, we decided to ask this:

Could we determine which train stations were popular with New Yorkers going out to bars and other nightlife spots and returning home late at night?

Backstory

Here’s the (admittedly fictional) backstory for this project. Our team name is the Pizza Rat Analytics Group 🍕🐭 (PRAG) and we are a marketing analytics firm. We are named after the venerable NYC icon of perseverance and good food, the pizza rat.

Our goal is to answer our hypothesis, discover which train stations are most popular with New Yorkers returning home from a night out, and present that info to our client, Grubhub/Seamless, so they can advertise at particular stations to reach this amenable demographic. After all, late-night home food delivery shortly after getting off the subway doesn’t sound bad at all.

From left to right: Ads inside a train station. A seamless ad you might see inside a train. An exterior ad on a subway station entrance.

Why Not Just Big Stations?

Advertising in the NYC subway system is a major way to reach out to New Yorkers, and everyone from Dr. Zizmor to the newest startups realize that. Of course, it is also a major expense. Rather than advertising in the most trafficked stations, it may be more cost effective to run advertisements in particular stations with a target demographic.

Data Wrangling

Now that we knew what we were looking for, we dove right in to the MTA data, which looks like this:

The key thing to note up there is the ENTRIES and EXITS columns, which do not contain actual entries/exits, but rather cumulative counter values from the individual turnstiles which we could subtract to get to actual entry and exit values. This distinction becomes important in a little bit. Each row is an “audit” of a particular turnstile in the system with the station and line it belongs to, along with other information, present in the row.

Download several data files from the MTA website spanning all of 2017.
Use the pandas library to read the csv files into data frames.
Start examining and cleaning the data, including, but not limited to:
- Finding and removing duplicate rows (due to “recovery” audits).
- Finding and correcting turnstile data with reverse counting counters.
Get rid of audits that happened way too far apart. We discovered some that were days apart. Information with that low of a collection frequency is useless when trying to see how entries vary throughout the day.

In the yearly dataset, this is when most of the audits were performed throughout the 24 hours of a day. Once duplicates and unusably long audits were removed, the mean sampling time for a turnstile was 4.5 hours.

Baseline days vs “Night out” days

Now that we had (relatively) clean data for a whole year, we proceeded with the next step. Each row of the resulting dataframe was grouped by the days of the week — Monday, Tuesday, etc. Then, using the audit start and end times, we uniformly distributed the entries during that time period to get to an average entries per hour datapoint.

Lastly, we said we were going to treat Mondays to Thursdays as baseline days, when people are mostly commuting, and treat Fridays and Saturdays as “night out” days, when nighttime subway entrances at certain stations should be above normal to account for the crowds coming back home from bars.

Results

Let’s first look at 3 weeks of data for Christopher Street (1) in the West Village normalized into an “average week” and plotted hourly:

Note: the entries on the y-axis are averaged per turnstile. The total entrance for stations in this and following plots is therefore, a larger number.

Between 9 PM and 4 AM on the weekend days, this station does see more entries than during the baseline days. This does indicate that our hypothesized difference does exist. Let’s look at another station that we know is in a popular area of the city. This time, we are plotting with a full-year of data.

Once again, there’s that increase in entrances at night.

To see if this increase truly was due to the type of neighborhoods these stations are in, we plotted 2 stops that we know are different:

Wall Street (4, 5) — heart of the financial district. Relatively dead at night.
79 Street (D) — a residential stop in Brooklyn.

And we do not see that increase in nighttime entrances here, further confirming our hypothesis. In summary — you can use turnstile data to determine which stations are more popular with New Yorkers returning home from a night out.

Next Steps

So far, we posted a hypothesis, and indicated that it might be correct by examining particular stations. But that’s not the full potential of this project. Given more time, this is what we would have liked to do next:

Quantify this nighttime increase in entries for all stations in the system and rank them in a descending order, uncovering which stations see a really large boost. Those would make for good advertising spots.
Perform similar analysis on turnstile exits to see if there are particular neighborhoods that this demographic is traveling to. These stops would also be good for advertising.
Compare geolocation data of train stations with business listings nearby to categorize the demographic of each station.
Create custom ads (with unique coupon codes) for areas/stations to measure effectiveness of our advertising approach.
Create interactive, geographic visuals to better present all this information.