Promoting Your Event in NYC? Use EDA to Find Out Top 5 High-Traffic Subway Stations

A step-by-step guide to Exploratory Data Analysis (EDA) — Here’s how you can easily find out top 5 high-traffic subway stations in NYC.

Published in

The Startup

4 min readJun 24, 2020

Hypothetically, we had a client reaching out to us, hoping that we can offer them some advice on getting more signups for their event. This client is WomenTechWomenYes (WTWY), which is an annual gala that happens during the summer in New York City. To drive attendance, their Street team goes to subway stations in NYC to attract people to sign up for the gala. How should they maximize their resources and time?

The very first step of any data science project is exploratory data analysis (EDA) — a critical step to understand your data. This includes gathering insights with the help of summary statistics and looking at visualizations to discover patterns, spot anomalies (missing values, outliers, exceptions, etc.), and test assumptions/hypothesis.

To help out our client, an EDA would actually suffice. So let’s get started!

Goal

Provide recommendations on how WTWY can optimize the placement of their Street teams in NYC subway stations, such that they can gather the most amount of signups, using Exploratory Data Analysis (EDA).

Methodologies

1. Obtaining data: We downloaded the NYC MTA turnstile data for June 2019 to obtain the daily entries of people passing through the turnstiles at all stations in NYC. We chose June 2019 as it’s right before the summer gala.

First, import all the necessary libraries in Python.

Then, download turnstile entries data from MTA for June 2019.

This is what the raw data looks like now.

First 10 rows of data downloaded from MTA

2. Cleaning data: The data contains the running total of entries and exits of turnstiles at all subway stations in NYC. Data scrubbing is needed, as some turnstiles are logging entries every 4 hours and some are not and some are logging entries backwards.

Fixing errors: We corrected the turnstile entries’ values where the entries were logging backwards.
Removing outliers: Then, we removed entries that are outside of 3 standard deviations.
Creating the desired data fields: Finally, we scrubbed the data and calculated the entries through each turnstile at all stations, broken down by day, day of the week, and four-hour interval during the day.

Since some stations have the same name yet serving different lines, we have to create a unique identifier to differentiate them — STATION_ID. As we want to break down the traffic incrementally, we’ll create variables to retrieve the turnstile entries by day, day of week, and time of day (four-hour interval). For example,ENTRIES_DIFF captures entries logged every 4 hours. Then, we can conduct some good old data cleaning.

3. Exploring data: We created 3 charts of top stations with the most foot traffic, broken down by day, day of week, and 4-hour intervals.

For the first chart, first set up a table to show the top 7 stations.

Then, create a bar chart.

For the second chart, drill down to see the entries by day of week for the top 7 stations.

Here’s the line chart!

Onto our last chart, dive even deeper to see the traffic for 4-hour interval throughout the day.

Lastly, the heatmap!

Recommendations

From the charts, we recommend placing the WTWY Street teams in the below top 5 subway stations, targeting weekdays during 4–8PM:

Grand Central at 42nd St
Herald Sq at 34th St
Union Square and 14th St
Fulton St
Port Authority at 42nd St

Future Work

If given more time, we would love to include the following implementations to further improve our analysis:

Find out the tourist-heavy time of day to avoid placing Street teams during those times, as tourists most likely will not be able to attend the gala locally.
Find out the location of tech companies in NYC that are also close to the stations, to attract people with tech backgrounds.
Create a weighting scheme to assign weight to each attribute, making up a final score of each station for ranking. Attributes can be stations with general foot traffic, close to tech companies, demographic data, etc.