EDA with New York MTA Data

Y.Kaan Baycan
7 min readOct 13, 2022

--

Here is my first project during my data science Bootcamp by Istanbul Data Science Academy.

Aim

WomenTechWomenYes, a fictional non-profit organization, is organizing a gala dinner in which they will raise donations and create awareness towards Women in business. They ask us to find the ideal date and time to locate their staff at a metro station to collect mail addresses and send them free tickets to their gala. The gala dinner will take place at the beginning of June. By using MTA turnstile data, we will try to analyze the busiest time interval and station.

So, let’s start.

This project consists of three steps:

1-Specifying Time Interval and Data Cleaning→ Phyton (Pandas, Numpy, Matplotlib)

2-Analysis →Phyton (Pandas, Numpy, Matplotlib, Seaborn)

3-Visualization →Tableau

1-Specifying the Time Interval & Data Cleaning

Since we have access to the dataset, we should first specify the time interval of turnstile data to be analyzed. Since the gala dinner will take place at the beginning of June, I have chosen May 2021 to be explored. You can access turnstile data from the link below in the Further Links section. Since it is parted weekly, I decided to use May 14-June 04 in my analysis.

After specifying and finding the data, it is time to code :)

# importing the libraries which will be used later
import pandas as pd
from datetime import date, time
from datetime import datetime as dt

Pandas library will be the tool to clean and analyze the data. After importing libraries, we import turnstile data to a pandas dataframe.

#Getting the turnstile data from MTAdf1 =
pd.read_csv("http://web.mta.info/developers/data/nyct/turnstile/turnstile_220507.txt")
df2 = pd.read_csv("http://web.mta.info/developers/data/nyct/turnstile/turnstile_220514.txt")df3 = pd.read_csv("http://web.mta.info/developers/data/nyct/turnstile/turnstile_220521.txt")df4 = pd.read_csv("http://web.mta.info/developers/data/nyct/turnstile/turnstile_220528.txt")df5 = pd.read_csv("http://web.mta.info/developers/data/nyct/turnstile/turnstile_220604.txt")

We also merge these five data frames into one, since we want to analyze the whole month combined using the concat function built into the pandas library

#After importing weekly data, we have to convert it into one table, so we use the pd.concat functiondfmay = pd.concat([df1,df2, df3, df4, df5], ignore_index=True)

After creating the data frame, the first thing we should do is to check the df with the .info() function.

#Checking for null values as well as the Data typesdfmay.info()output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055980 entries, 0 to 1055979
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 C/A 1055980 non-null object
1 UNIT 1055980 non-null object
2 SCP 1055980 non-null object
3 STATION 1055980 non-null object
4 LINENAME 1055980 non-null object
5 DIVISION 1055980 non-null object
6 DATE 1055980 non-null object
7 TIME 1055980 non-null object
8 DESC 1055980 non-null object
9 ENTRIES 1055980 non-null int64
10 EXITS 1055980 non-null int64
dtypes: int64(2), object(9)
memory usage: 88.6+ MB

Merging date and time will make our job easier when analyzing the correlation between day and time. Using the datetime library, we imported.

datettimes = pd.to_datetime(dfmay["DATE"] + " " + dfmay["TIME"])
dfmay.insert(loc = 11,column = "datetime", value = datettimes )

Also, the first three columns define turnstile; wecan merge it for later use.

turnstiles = dfmay["C/A"] + dfmay["UNIT"] + dfmay["SCP"]
dfmay.insert(loc = 12,column = "turnstile", value = turnstiles )

We have merged five columns into two new columns so we can drop the old ones and“DIVISION, DESC, LINENAME” since we will not use them.

#Dropping irrelevant data from the dataframe to enhance efficiency.dfmay.drop(columns = ["C/A", "UNIT", "SCP", "DATE", "DIVISION", "DESC", "LINENAME"], inplace = True)

Before analyzing the dataset, one last thing to do: always check for spaces in column names, rows, etc. You may suffer later if you skip this part, so one line code may save your hours.

dfmay.columns = dfmay.columns.str.strip()

2-Data Cleaning and Analysis

After creating a more manageable and smaller dataframe, we can start to analyze the dataset.

First 5 rows of the dataframe

Here we can see that the entries and exits columns are increasing cumulatively. Hence, two new columns are necessary to see the exact number of passengers entering and exiting from the metro station. I can say that the hardest part was this one. Getting the difference between two consecutive rows is not hard, but since all of the stations are in the same dataframe, differences between different stations should not be taken. Here, the .groupby() operation comes up and saves the day.

#since the turnstile data increases cumulatively, we have to take the difference between the rows by grouping them by stationsdfmay['net_entry']= dfmay.sort_values(['turnstile','datetime'],ascending = (False, True)).groupby(['turnstile'])['ENTRIES'].diff()
dfmay['net_exits']= dfmay.sort_values(['turnstile','datetime'],ascending = (False, True)).groupby(['turnstile'])['EXITS'].diff()
#don't forget to drop old columns
dfmay.drop(columns = ["ENTRIES", "EXITS"], inplace = True)

After creating the columns of net_entry and net_exits, it is time to check for data in these two columns with the .describe() function.

How can the amount of entries and exits be negative? So let’s explore further the dataset using a boxplot.

This image made me work much more than the rest. So, boxplot is pictured, but where is the boxplot?! After understanding that I will not be able to find any solution under these circumstances, I started cleaning.

def extract_whiskers(data, iqr_multiplier=1.5):
median_value = np.median(data) # Median
upper_quartile = np.percentile(data, 90) # 90%
lower_quartile = np.percentile(data, 25) # 25%
iqr = upper_quartile - lower_quartile #Interquartile Range

print("Upper Whisker:", data[data<=upper_quartile+ (iqr_multiplier*iqr)].max())
print("Lower Whisker:", data[data>=lower_quartile-(iqr_multiplier*iqr)].min())

I used a function from the bootcamp’s notebook to find the ideal upper and lower whiskers. So, I chosed to take the data between the 25%-90% interquartile range.

I will deal with the negative values later.

Let’s create the boxplot again with the new dataset.

dfnew = dfmay[(dfmay["net_entry"] < 479) & (dfmay["net_entry"] > -281) ]

Firstly, I defined the new dataset with the filters that I found.

Now, we have a better dataset with trustable values. Time to act against negative values. Let’s firstly check them out.

When I look at this dataframe, I think that the if the values were positive they were good to go. I may either delete the negative values or change negative values to positive and use it like that. I would like to challange more and will try the second way. I start with comparing the mean of the net entries at different time intervals.

dfnew[(dfnew["net_entry"] < 0) & (dfnew["TIME"] == "00:00:00")].mean()
dfnew[(dfnew["TIME"] == "00:00:00") & (dfnew["net_entry"] > 0)].mean()

After checking different time intervals, I chosed to go with changing negative values to positive since the comparison didn’t show any radical difference between negative and positive values. In the notebook you can reach to further describe( ) conclusions regarding negative and positive dataset.

dfnew.net_entry = dfnew.net_entry.abs()

Finally we have a clean and well-dealt dataset. Of course I repeat the same process for net_exits as well.

dfnew.drop(columns = ["ENTRIES", "EXITS"], inplace = True)

Then, let’s find the total density on a station by adding net_entry & net_exits into a new column, “total.”

dfnew.insert(loc = 6, column = "total", value = dfnew["net_entry"] + dfnew["net_exits"])

Our dataset is ready to transfer into Tableau to create visualizations. Though, 1045399 rows will be a problem for an excel(more than 1m rows are not going to work well and takes time to export) we export it as a csv file.

dfnew.to_csv("New_York.csv")

3-Visualization

Now, we should define the path of analysis. Will it be from top stations to time intervals, days to stations, time intervals to stations to days? I have chosen to start with finding the most crowded hours.

Ideal Hours

As can be seen, the most crowded hours are: 20:00, 16:00, and 12:00.

Ideal Days

By finding the total number of passengers every weekday, we conclude that the most crowded days are as follows: Thursday, Wednesday, Friday and Tuesday.

Ideal Station

After finding the ideal time interval and weekday, we can filter the stations by these values to find the busiest station. In the chart below, top 10 most used stations are shown.

As I found, the best option is with 74.290 passengers is 34 ST-PENN on Tuesday at 20:00.

Dashboard

After creating all the visuals, I prepared an interactive dashboard to present. With this dashboard, you can filter the stations by choosing a day and an hour from the charts above to see the most crowded station at that time interval.

I would be thrilled to get feedback, questions, or claps :)

Further Links

Here is the GitHub Repository to reach the Jupiter notebook file as well as the Tableau Workbook: New-York-MTA-EDA

Dataset: mta.info | Turnstile Data

Linkedin: Yakup Kaan Baycan | LinkedIn

--

--