First Week At Metis — NYC Commuter Data Analysis with Python

Natasha Borders
@natashaborders
Published in
5 min readApr 7, 2019
Photo by Asael Peña on Unsplash

The first week of my data science bootcamp at Metis just came to an end, and it seems like my fellow aspiring data scientists and me have been in it together for years. Starting from day one we dove into exploratory data analysis head on, with our instructors not pulling any punches and providing us with a lot of freedom in analyzing our very first dataset.

This was a challenging proposition for me. When my husband, Casey, cheerfully informed me we were moving to California only a few short months prior, I was busy happily running two businesses, Bart & Gabriel Pet Sitting and Hanging Lantern, in Columbus, Ohio. But when Google calls, software developer answers, and so we packed up our life and moved to California. As soon as I determined that data science was going to be my new career, based on my passion for solving problems and endless curiosity, I applied to the data science bootcamp with the most meticulous admission process, learned Python, Pandas, and other required tools, and got accepted to Metis for the Spring 2019 cohort.

Which brings me back to Metis on week one and our first team project, and how surreal it seemed that on day one we would begin being data scientists in earnest. For this project, our imaginary client, Women Tech Women Yes, tasked us with determining where in New York City would be the best spots for sending the street teams to promote their upcoming summer Gala, with two goals of collecting emails from people for future outreach efforts and to send those interested free tickets to the Gala.

Our Approach

We decided to focus on two main objectives: finding the busiest locations to generate the highest volume of sign-ups for the breadth of outreach, and targeting the most affluent neighborhoods to find those most likely to donate larger amounts to the Gala when attending.

For our dataset, we downloaded three months worth of MTA commuter data from their website, focusing on March through May, since the Gala would be scheduled for early summer. Our main assumption was that the street teams would only be working during the week so we focused on providing actionable insights on where and when they should be stationed for maximum marketing payoff.

Cleaning and Organizing the Data

When we got a first glimpse of our data, the terminology did not make much sense to us:

Raw data sample from our MTA dataset.

We noticed that ENTRIES and EXITS columns conveyed the cumulative turnstile counts throughout each day and did not accurately portray the traffic flow through the stations. As a result, we took the differences between subsequent entries and exits, and aggregated them into a joint Traffic column, where each data point signified the combined number of entries and exits through that turnstile for the four hour period up to the time listed.

Then we proceeded to remove any outliers (negative numbers generated through errors in the turnstile operation, and removed top 1% of the Traffic entries which were too high to be feasible and most likely occurred due to the turnstile resetting itself and restarting the count at 0. Using Python, Pandas, Matplotlib, and Seaborn libraries, we examined our data and created some visualizations to help us understand the patterns of the commuter traffic through the city.

Exploratory Data Analysis

The first result we looked at were the busiest stations in NYC overall:

Overall Highest Traffic by Volume (March — May 2018)

Focusing on the traffic by station throughout the week days, we observed that the traffic was fluctuating slightly for the commuter stations of Grand Central and Penn Station, and higher on Fridays for the tourist-heavy Time Square.

Cumulative Traffic by Weekday (March — May 2018) for Top Five Stations by Overall Volume
Cumulative Traffic at Penn Station on Tuesdays (March — May 2018)

After looking in a bit more detail at the Tuesday and Thursday traffic at Grand Central and Penn Station, we recommended the street teams target those stations during the morning and evening commuter hours.

Cumulative Traffic at Times Square on Fridays (March — May 2018)

Times Square looked a bit busier during the afternoon and evening hours on a Friday, making it a good destination for that period for the street teams.

After examining the data purely by volume, and recommending Women Tech Women Yes to focus on the commuter traffic on Tuesdays and Thursdays and on the evening leisure and commuter crowd at Times Square on Friday nights, we shifted our focus to take a look at the prospective donors who might be residing in more affluent neighborhoods of New York. Examining the Upper East Side (cited as one of the highest median income neighborhoods in the city), and focusing on the subway stations in the area, the 86th Street station emerged as the one to visit.

Cumulative Traffic by Day of the Week (March — May 2018) for Upper East Side Stations

Considering we had not yet suggested a Wednesday target, the 86th Street appeared as a good place to be during the commuter hours and perhaps over the lunch hours as well.

Conclusions

Based on the exploratory data analysis we conducted, it was clear that the highest traffic commuter stations would be best targeted during the morning and evening hours, and the tourist leisure centers would prove productive during the Friday afternoon and evening hours. The wealthier commuters could be found in their neighborhood subway station during the same busy hours.

Further Exploration Ideas

One short week was definitely not enough for us to dissect all the potential insights that could be collected from this vast dataset. Given more time it would have been interesting to consider the intersections of tech-heavy neighborhoods, population of women in tech and women-owned businesses, and targeting those commuters more directly, as well as looking further into the traffic flow throughout Manhattan and other boroughs.

Our analysis can be found on GitHub, and the source data can be found at the MTA website.

Reflecting on everything I learned this week, it is amazing to see how quickly one enters the mindset of a data scientist. I am looking forward to the future projects at Metis and beyond, curious to see what further insights they will bring!

--

--

Natasha Borders
@natashaborders

Data Scientist | Analyst seeking roles in the San Francisco Bay area. LinkedIn: in/natashaborders/ | GitHub: natashaborders | Me: natashaborders.com