Exploring MTA Turnstile Data in NYC

Ceren Alkan
İstanbul Data Science Academy
5 min readNov 20, 2021

MTA Turnstile Data Analysis is the first project in my data science journey, and I worked on exploratory data analysis (EDA) specifically for this project. I analyzed the Metropolitan Transportation Authority (MTA) Turnstile Data of New York City at the request of the WTWY (Women Tech Women Yes) for their upcoming gala. I used pandas, numpy, matplotlib and seaborn libraries in this project. Before jumping to the conclusion, I had to understand the data well. I will try to explain each step that I made in this project.

Photo by Paulo Silva on Unsplash

The Data

Firstly, we have to understand what the problem is and what might be our solution. WTWY wants us to find the most crowded stations across the New York City for their upcoming event. And this event will take place in the summer. Therefore, let us take the beginning of the summer as our dates for the data. We can use this site https://new.mta.info/ to download the required data, and for this purpose, I used the code below:

As you can see in the code, we take 2021–06–05 as the first date, and after that, we take 4 weeks. Because the gala will take place in the summer. We have to concatenate these 4 files and make one data frame. Also, sometimes the data’s columns might be named unexpectedly like in this project, so we can rename the columns just to be sure.

Each row of the dataset represents a singular turnstile. Individual turnstiles have unique identifiers built from a combination of its Station, Remote Unit, Control Area (C/A), and Subunit Channel Position.

If we continue to examine the data, we can conclude that the entry and the exit values are cumulative and the data is recorded in 4 hours period. So, if we want to find actual entry and exit values, we have to take the difference between two rows. With this way, we can get 4 hours of entry and exit values correctly, and we will use pandas built in function diff() for this purpose. While doing this, we might want to group the turnstiles to eliminate the confusion. While taking the difference of the two rows, the first two values of entries and the exits become NaN, due to the nature of the diff() function. I filled them with 0 later in this code (line 11–12). We have negative values and the outliers as exit and entry values. To eliminate them, I used the below method. Taking standard deviation and the mean of the entries and exits, and applying the below method is one way to eliminate the outliers from data. (line 17–18) The total traffic can be found with summing the entry and exit data values. After the summation, we can apply the same filter just to make sure.

Now, our data will look like this:

This time, we want to know what days are these for further analysis. For this step, we will use datetime library like in the code below:

And after that we should see our data as below.

For this step, if we want to eliminate the tourists, as they would not want to participate the gala, we can simply think that taking the weekdays is the true way. However, by doing this, we ignore the fact that the tourists can be in these stations in weekdays too. If we assume that these tourists can be near the stations, both in the weekdays and the weekends, one possible solution might be subtracting the weekend traffic from the weekday traffic. But we should be careful about the behaviour of the traffic by looking at the related graphics. In the last step, we will sum the whole traffic values by the stations. That’s why, if we take the negative of the weekend values, we can perform the subtraction process easily. In order to take the negative values of the weekend traffic, we can use lambda function. Of course, this is not the ideal solution, but we can eliminate the tourist effect a bit more.

After taking sample from our data, the data should be seen like below.

If we look closer, we can see the awkward hours in our data. To eliminate the wrong consequences, we can round these values around 4 hour frequencies. Also, if we want to eliminate the tourist effect more, we might just take the traffic data within work hours.

Now, let us find the most crowded five stations, and see their graphic with the help of the seaborn library.

To improve our suggestions, we can look at the tech companies around these top five stations. The map below can help us with this.

Conclusion

If we assume that the WTWY wants the people who work in tech companies or the people that can make the contribution about their mission, we can take the stations near the technology companies. As we can see, 86 St is further from tech companies than others. Therefore, we can change the station order according to the request of WTWY at the end. In conclusion, our top five stations are 34 St — Herald Station, 34 St — Penn Station, 59 St, 42 St — Times Square and 86 St. However, the order can be changed according to the request of the WTWY.

Future Work

Tourists may mislead the results, and we might want to check the tourist numbers across the stations. If we compare peak season and off-season, we can normalize our data according to tourist attractions.

Thank you for reading.

Ceren Alkan

--

--