Analysis of New York MTA Turnstile Data
Using the turnstile data to find the top ten subway stations with most traffic
For out first project in Metis, We worked with New York City MTA Turnstile data to figure out which stations had the most number of traffic for a given period of time! For this project we used Pandas, Seaborn, and Matplotlib, to perform exploratory data analysis and visualize our results. In this article, I will go through our goal, cleaning the data set, analysis, and the conclusion.
The Challenge:
WomenTechWomenYes(WTWY) has an annual gala at the beginning of every summer with their goal to increase the participation of women in technology and to concurrently build awareness and reach. For this goal, WTWY is trying to deploy street teams to the New York City subway stations. The street teams will collect email addresses and those who sign up are sent free tickets to the gala. WTWY wants us to analyze the MTA data set and give recommendations on which subway stations to deploy their street teams for maximum signatures.
Data Cleaning:
The MTA opens their turnstile data to the public and can be downloaded for free here. Here are the different variables found in the dataset:
C/A = Control Area
UNIT = Remote Unit for a station
SCP = Subunit Channel Position represents an specific address for a device
STATION = Represents the station name the device is located at
LINENAME = Represents all train lines that can be boarded at this station
DIVISION = Represents the Line originally the station belonged to BMT, IRT, or IND
DATE = Represents the date (MM-DD-YY)
TIME = Represents the time (hh:mm:ss) for a scheduled audit event
DESc = Represent the “REGULAR” scheduled audit event (Normally occurs every 4 hours)
ENTRIES = The cumulative entry register value for a device
EXIT = The cumulative exit register value for a device
For this project we used the turnstile data from 2016 to 2018 for the months from March 1st to June 15th. Notice that Entries and Exit columns are cumulative measures and are measured every four hours. In order to calculate the number of people who went through a specific station in a given four hour interval, we subtracted each rows of the cumulative entries and exits. We will then add the subtracted cumulative entries and exits to form a column for “total number of traffic”.
Data Analysis:
After cleaning the data and creating the “total number of traffic” column we were able to start analyzing the data. Below are some graphs that can be used to give recommendations to WomenTechWomenYes(WTWY) on where to deploy their street teams.
Above is a bar graph of the top ten stations with the most number of traffic for the months March to June from 2016 to 2018.
The results above show which stations would be most effective to deploy the street teams for each day of the week. In order for the WTWY to get the most number of subscribers and possible donations, it makes sense to target stations with the most number of traffic.
As the map above shows, most of our top ten stations are also located in the big tech sectors of Manhattan, which gives a higher chance of targeting people working in the tech industry.
Conclusion:
For WTWY’s(WomenTechWomenYes) annual gala, we would recommend deploying street teams on 34st-Penn Station, Grand Central-42st, 34st-Herald SQ, 23st, Time-Square 42st, 14st-Union SQ, 86 st, Fulton st, 42st-Port Auth, and 59st Columbus on the weekdays. On the weekends, since there will be less street teams deployable, we will recommend to focus on 34st-Penn Station, Grand Central-42st, 34st-Herald SQ, 23st, and Time-Square 42st.
Further Steps:
There are few things I want to implement to improve the project:
· Take into account tourist and non-tourist areas in order to more focus on local commuters.
· Investigate on tech related events that occurs in Manhattan near the time of the Gala.
· Gather demographics data to identify areas with most tech related residents.
Technical Note:
Python Pandas Dataframe was mainly used to clean and organize data. Various functions including Groupby, loc, and iloc method was used to group data into comparable format. Finally Matplotlib and Seaborn was used to visualize the data and create graphs.