Metropolitan Transportation Authority (MTA): My 1st Data Science project-WomenTechWomenYes

Shabnam Hamidi Moghaddam
İstanbul Data Science Academy
4 min readNov 30, 2021

My team and I just finished our first project at İstanbul Data Science Academy in Istanbul. I’m glad to complete a real-world project specifically designed to showcase our learned data scientists skills.

There is a fictional organization, WomenTechWomenYes (WTWY), which promotes the participation of women in technology. WTWY holds an annual gala in early summer and they are looking to collect email addresses and send tickets to a free gala by placing street teams at subway entrances in New York city. So the purpose of this project is determining the busiest metro stations, the busiest day and the busiest hour.

Photo by Joseph Ngabo on Unsplash

The Method:

The dataset that we worked with is the MTA Turnstile Data which is publicly available online. The gala takes place in early summer, so we decided to use 1 March 2021 - 31 May 2o21.

The following Python libraries are essential for Exploratory Data Analysis: Pandas, Numpy, Seaborn, and Matplotlib.

We must first import the packages via the Python's import command:

Taking the first 5 rows of the data by writing the code will be as following:

The descriptions of the column features are given here.

Before analyzing the data, it’s important to make sure the data is clean. analyzing bad or dirty data could cause to reach the wrong conclusions and implement ineffective changes.

As you see in the TIME column, the readings occur every 4 hours.

We create a new column that is called TURNSTILE by aggregating the C/A, UNIT and SCP.

At the ENTRIES and EXITS columns we must subtract the previous one and the next one to know how many entries there were at a given observation.

We combine the DATE and TIME fields into a string and convert it into a DATETIME object.

To find that which station has the most traffic, we aggregate the d_Entries and d_Exits columns and create the TRAFFIC field.

After drop columns that we don’t need, we create a new column of WEEK to represent the days.

It’s important to make sure our data is clean so that our eventual analysis will be correct.

The Result:

The most busiest station is 34 ST-PENN STA and the most busiest day is Friday as is shown below.

For showing the most busiest time we use Pie Chart and with the heatmap we can see the best option according to MTA data is 34 ST-PENN STA station on Friday at 12:00–16:00.

Busiest days and times for 34 ST-PENN STA station

We use additional data of New York Census Data and we find out that those with higher income are more likely to donate and Females may be more interested in the event than Males.

The best option according to the number of woman in NYC Census data is Brooklyn and Queens
The best option according to the average per capita income in NYC Census data is Manhattan.

The Conclusion:

According to the analysis, the busiest station is shown 34 ST-PENN STA station and the busiest day is Friday.

The busiest time is shown 12:00–16:00 and the best option is Friday in 34 ST-PENN STA station.

The best option according to the number of woman in NYC Census data is Brooklyn and Queens. The average per capita income in NYC Census data is Manhattan.

--

--