Data Analysis Project: The Analysis of Subway Turnstile Data in New York City

5 min readJan 3, 2023

In this article, I would like to talk about my exploratory data analysis project about MTA New York City subway turnstile data. I did data analysis because there was a case study which was about optimization of street teams. There will be an event in NYC in the last week of June 2023. Street teams will be in subway stations, and they will get e-mail address from people to sell the event tickets. My aim was to optimize distribution of street teams and make them get as many email addresses as possible.

Firstly, I benefited from the turnstile data at MTA (Metropolitan Transportation Authority) which provides public transportation service in New York. Weekly subway turnstile records can be found from MTA website. I decided to consider data from the last week of February to the first week June in 2022. I wanted to observe these data in this time interval because I thought 3 months before June shows behavior of people traffic in subway stations. Also, there were about 3 million records in this interval, and these were enough to understand people traffic flow.

After deciding which data to use, I started to explore data and clean it in Jupyter Notebook. I checked whether there were null values and repeated values or not and in the data frame. There were no null values, but the data frame included repeated values. One of the examples is shown below.

As can be seen, there were two records for same turnstile, date and time. These duplicated values were dropped. Thus, initial cleaning data was completed.

“ENTRIES” and “EXITS” columns included cumulative values so differences between two consecutives were calculated by grouping turnstile and using transform and diff functions in python. “ENTRIES_COUNT” and “EXIT_COUNT” were created and differences between records were assigned to these two columns.

After determining entry and exits, I checked whether there were negative values or not. I find some negative values by filtering data so I created a new data frame by taking only positive values. In this way, negative ones were eliminated.

Negative Values for ENTRIES_COUNT & EXIT_COUNT

The Data Frame including only Positive Values

When the data frame was analyzed by using describe function, I realized that there were high outlier values. These outliers could change actual behavior of the data frame and they wouldn’t have let me analyze data efficiently. Therefore, I decided to set the upper limit to eliminate outliers. Assumption is that if a person can pass a turnstile in 3 seconds, 20 people can pass a turnstile in a minute, 1200 people can pass a turnstile in an hour and 4800 people can pass a turnstile in four hours. Thus, values higher than 4800 were removed the data frame.

The Description of the Data Before Upper Limit

The Description of the Data After Upper Limit

After all operations, the data frame was ready for analysis. I added entries count and exit counts to determine the total people and I found the busiest 10 stations by grouping stations and summing total people.

After finding top 10 stations in general, I wanted to analyze the busiest 10 stations by dividing it into the days.

Daily Total People Traffic at Top 10 Stations

As can be seen, it was observed that the busiest days were weekdays and there was a considerable difference between weekends and weekdays. Therefore, I analyzed weekdays and weekends separately. The stations were the same except for 2 stations which were CANAL ST for weekends and FULTON ST for weekdays.

Top 10 stations and day analysis were not enough to solve the problem, so I modified the data by creating a new column about time interval. I divided a day into 6-time intervals which were 0–4, 4–8, 8–12, 12–16, 16–20, 20–24. Thus, I observed that how the total people traffic changed at the time intervals at the busiest 10 stations for weekends and weekdays.

Time Interval Analysis on Weekdays at Top 10 Stations

Time Interval Analysis at Weekends at Top 10 Stations

As can be seen, 8–12 and 16–20 were the busiest intervals on weekdays because these were the start and end of the working hours. However, 12–16 and 16–20 were more crowded than the other intervals at the weekends.

After the analysis was completed, I showed the 10 the busiest stations for weekends and weekdays on the map by folium and ArcGIS module in python. It can be seen that 9 stations are the same except for CANAL ST at weekends and FULTON ST on weekdays.

Locations of the Top 10 Stations at the Weekends

Locations of the Top 10 Stations on the Weekdays

Finally, I had to draw a conclusion from the analysis and find a solution to the problem. The results are shown below:

· Street teams do not need to work at CANAL ST on weekdays and at FULTON ST at the weekends. They can work at the other 9 stations for weekends and weekdays.

· Street teams can work at 8–12, 12–16, 16–20-time intervals for both weekends and weekdays. They don’t need to work at 0–8-time intervals.

· While more teams can work at 8–12 and 16–20 intervals than 12–16 interval on weekdays, more teams can work at 12–16 and 16–20 than 8–12 at the weekends.

Also, the code can be reached from the GitHub link: https://github.com/LatifTukuc/DataAnalysisProject-SubwayTurnstile-NYC

Data Analysis Project: The Analysis of Subway Turnstile Data in New York City

Written by LatifTukuc