Exploratory Data Analysis (EDA) of NYC MTA Turnstile Data with Visualization Using Folium Map

4 min readSep 30, 2019

Preliminary results & recommendations for the next summer gala for WomenTechWomenYes (WTWY)

As my first major step into the mystical world of data science, I enrolled in a full-time data science bootcamp at Metis in Seattle. The process to get here was no small task. After completing three challenges to demonstrate my math, Python, and project design skills as well as a technical interview, I was fortunate to be selected to join the Seattle Fall 2019 cohort. I packed two check-in bags and uprooted myself from my family, friends, and job industry. Armed with completed pre-work assignments and enthusiasm, I arrived to meet my cohort of brilliant students from all over the country, who share my passion for learning and dream of becoming a data scientist. Here is what I did in week 1 of my journey.

Problem Statement

WTWY has an annual gala at the beginning of the summer, as a fundraiser to support their mission to increase participation of women in technology. They wish to optimize their deployment of street teams — using MTA subway data — to increase awareness, increase sign ups and attendance for their gala event, and increase contributions to WTWY. My team was asked to help create a street team deployment plan that will optimize WTWY’s resources and achieve their goals. With less than a week to present our results, we prepared a summary of preliminary findings and demonstration of how we can proceed with future work.

Methods

The scope of this project required us to use OSE of the OSEMN model: Obtain, Scrub, and Explore.

Obtain

As per our client’s request, we started by using MTA subway usage.

Assumption Alert: We used Spring 2019 MTA turnstile data — under the assumptions that WTWY would likely be advertising the gala during spring and that spring 2019 data would best predict spring 2020 travel patterns.

Scrub, Explore, Repeat

We then scrubbed the data and performed aggregations to find average daily entrances and exits by station name. Below are the steps we took to achieve this:

Better understand the data collection methodology: We found additional information about how the MTA turnstile data was collected via the MTA Field Descriptions, NY.Gov MTA Data Overview, and NY.gov MTA Data Dictionary.
Convert cumulative turnstile counter readings to actual entry/exit counts: Count from current reading minus count from previous reading. The sum of these counts were aggregated to reflect traffic during the time intervals between the last reading and current reading.
Filter out backward counts: Sometimes the counts for a certain audit reading would be followed by progressively lower numbers. This created a problem in our approach to subtract current reading minus previous reading — by resulting in negative “actual counts”. To resolve this, we took the absolute values of our actual counts.
Identify and filter out turnstile counter resets: According to the NY.gov Data Dictionary, when counter devices are replaced, the memory device containing the current count is erased (i.e., counts are reset). Because these counts resulted in high and low changes in reader counts with no way to reconcile the counts during those time periods, we removed the resets.
Filter out “Recovery readings”: Sometimes a scheduled audit to record the current count was missed, and then the subsequent audit would be labeled as “RECOVR AUD”. These resulted in large and unpredictable deltas per time period, and therefore we filtered these counts from our data set.

Given the limited information of WTWY’s available resources for their street teams, our primary focus of this first deliverable was the “where” (i.e., optimal locations to deploy their teams). We identified the top subway stations with the highest traffic, but one key assumption of targeting subway station by volume is that all subway riders are equally likely to be interested in attending WTWY’s gala and contributing to their cause.

To overcome the limitation of using only the MTA data, we performed additional research about the surrounding areas of these locations to narrow our results. We were specifically interested in the locations of tech companies, as we believe that people who work for tech companies are more likely to understand the benefits of advancing women in technology and support WTWY’s work.

Results

Recommendations

Based on the station traffic volume and the proximity to locations identified as tech hubs, we recommend WTWY to deploy street teams to the following 6 stations:

1.34th St — Penn Station

2.42nd St — Grand Central Station

3.42nd St

4.34th St — Herald Square

5.23rd St

6.14th St — Union Sq

Future Work

We would gladly welcome the opportunity to continue to work with WTWY, given more time and details, to propose a more detailed plan that best fits their needs. Some of the additional offerings we can provide include:

A schedule by day, time (e.g., morning and afternoon), and team size for selected subway stations.
Optimized target list of subway stations based on additional factors, such as demographic information of travelers and more geographic data of areas surrounding the stations. Additionally, if available, we can incorporate data captured from WTWY’s historical records from previous galas: demographic information of past sign ups and/or contributors, past records of successful street team locations.