Project: WomenTechWomenYes Gala

Muhammed Fatih Gulturk
İstanbul Data Science Academy
10 min readDec 31, 2021

Exploratory Data Analysis on the New York City MTA(Metropolitan Transportation Authority) Turnstile Data and New York City Census Data

New York City Subway

Hello everyone. In this article, I share with you the Exploratory Data Analysis (EDA) that I have done with my teammates on the New York City MTA Turnstile Data and New York City Census Data for the annual gala of the WomenTechWomenYes(WTWY) organization, the first project of the Data Science Bootcamp, which I really enjoyed being in, of the İstanbul Data Science Academy with the contributions of Hepsiburada.

0. Information

As the name suggests, the WomenTechWomenYes(WTWY) organization aims to organize a gala and raise donations in New York City at the beginning of the summer in order to increase the effectiveness and competences of women in technology.

1. Problem Statement

WTWY Organization aims to place its street crews at the entrances of subway stations in New York City and distribute tickets for the gala to the people passing by. Thus, the main goal is to distribute a lot of tickets and to distribute tickets to people who have the potential to donate to the organization at the gala.

For that, we used MTA Turnstile Data, because they want to know the busiest day and time slot of the busiest station to deploy their street crews. In addition, we also used New York City Census Data, because we thought that women might be more prone to collect donations for this organization, which aims to increase the effectiveness of women in technology, and we wanted to concentrate on boroughs with a higher women population in New York City. Apart from this, since the main purpose is to collect donations, we wanted to identify the boroughs with the highest average per capita income in New York City.

2. Methodology

While doing this project, we followed the following roadmap in order as a method:

2.1. MTA Turnstile Data

2.1.1. Receiving the data

2.1.2. Data Cleaning and Tranformation

2.1.3. Exploratory Data Analysis (EDA)

2.2. NYC Census Data

2.2.1. Receiving the data

2.2.2. Data Cleaning and Tranformation

2.2.3. Exploratory Data Analysis (EDA)

2.1. MTA Turnstile Data

2.1.1. Receiving the data

We got the MTA Turnstile Data we used from the following site:

The site linked above

Since the gala will be held at the beginning of summer, we decided to use the data of March, April and May in order to make more accurate analyzes. In this result, we collected the data from 6 March 2021 to 29 May 2021 in the site.

2.1.2. Data Cleaning and Tranformation

We performed Data Cleaning/Transformation and Exploratory Data Analysis in Python and Jupyter Notebook environment using Numpy, Pandas, Matplotlib, Seaborn and Datetime libraries.

After loading the libraries and running our data in dataframe format, we got the following outputs:

Using the TIME and DATE columns in our dataset and the Pandas library, we produced a column named TIMESTAMP and in datetime64[ns] datatype as you can see in the code block below. The purpose of doing this is to be able to sort our dataset in the following stages according to the adjacent day and time.

Then, using Pandas and Datetime libraries, we created a WEEKDAYS column that shows which date corresponds to which day from the DATE column.

Then, we created our TURNSTILE column by combining the C/A UNIT and SCP columns in our dataset. Because the combination of the values ​​of these columns corresponds to a kind of identity of each turnstile.

Then, we sorted our dataset according to STATION TURNSTILE and TIMESTAMP columns so that if there are duplicate dates, stations or turnstiles, we can drop them.

Then, now that we have sorted our dataset, we can find the numbers NET_ENTRIES and NET_EXITS. We should do this by subtracting the previous ENTRIES and EXITS data from each ENTRIES and EXITS data. Then let’s drop the columns that we will not use.

Next, we subtract the Outlier values ​​from our dataset. These were the data in NET_ENTRIES and NET_EXITS other than 0–10000. Because the presence of more than 10000 data corresponds to approximately 10–11 people passing through the same turnstile in one second.

Next, we created the TRAFFIC column corresponding to the sum of the NET_ENTRIES and NET_EXITS data. Because we were going to show the density over this T column. And also, we created the WEEKDAYS_INDEX column using the Pandas and Datetime libraries to show the days in order in the visualizations we will make. This column shows which day corresponds to which day of the week in terms of numbers.

2.1.3. Exploratory Data Analysis (EDA)

In order to find the busiest stations, which is the first step of our Exploratory Data Analysis, we grouped our dataset by STATION with the sum function that making the summation of all numerical data and asked it to bring us the top 10 stations with the highest traffic in terms of TRAFFIC .

TOP 10 BUSIEST STATIONS

And then we wanted to see it in a horizontal bar plot:

Next, we wanted to find the busiest days on the basis of traffic by grouping our dataset by the WEEKDAYS column and by summing all numerical data with the sum function and sorting it by the WEEKDAYS_INDEX.

TRAFFIC DENSITY ACCORDING TO WEEKDAYS

And then we wanted to see it in a horizontal bar plot:

Now it’s time to analyze our dataset according to the busiest hours. However, the time intervals given in our dataset vary from data to data. In other words, while some data go as 03:00–07:00, some data go as 04:00–08:00, and even some data have much more complex time zones. That’s why we wrote the following code to keep the time zones in a fixed range.

Now that we have adjusted the time intervals, we can now group our data set according to the TIME column, take the sum of the numerical data with the sum function, and sort them according to the data in the TRAFFIC column from the most to the least. In this way, we can see the time intervals from the most intense to the least according to the data in our MTA Turnstile dataset.

BUSIEST TIMES INTERVALS

With the code below, we can draw a pie chart showing our busiest time intervals.

Since the purpose of the gala is to collect donations and participation, this gala is not for tourists at all. Therefore, while showing this subway traffic, we wanted to focus on places where tourists will be less and local people will be more. To understand this, we followed the following path: “If the decrease between weekday and weekend traffic in metro stations is very high, there are more local people there and therefore less tourists.” Because tourists usually visit on weekends and local-working people usually work during the week and rest on the weekends, we developed an idea in this way.

In the code line below, you will see station-based traffic density divided by days of the week.

Then, based on the busiest station graph we found above, let’s find the traffic density of the 5 busiest stations one by one, divided by the days of the week:

1ST BUSIEST STATION
2ND BUSIEST STATION
3RD BUSIEST STATION
4TH BUSIEST STATION
5TH BUSIEST STATION

Then we combine these 5 busiest stations:

And finally, we draw a line plot showing the traffic density on a daily basis for these 5 stations. Thus, we can see how much of a decrease in which station there is in the transitions from weekdays to weekends.

In the graph above, we estimate that the decrease in 34 ST-PENN STA station is much more severe and that this station is generally used by local-working people, and it suits our purpose more. And also, we see that the drop at station 14 ST-UNION SQ was slightly lighter than the others, so we thought that the density at this station was mostly caused by tourists.

And finally, let’s determine the busiest day and hour range of station 34 ST-PENN STA, which we think is the busiest and least touristic. For this, first, let’s pull the data of only 34 ST-PENN STA station from our main dataset (df):

Now let’s edit the time intervals in the TIME column as we did above:

Now, to see the traffic density on the basis of time and weekdays in our dataset, let’s group them as TIME, WEEKDAYS and WEEKDAYS_INDEX and see the sums of the TRAFFIC column with the sum function:

And finally, let’s draw a heatmap showing the traffic density of our busiest station, 34 ST-PENN STA, on the basis of time and weekdays:

Thus, from the heatmap above, we see that the busiest day and time period of 34 ST-PENN STA station is Friday, 12:00–16:00 to deploy our street crews to distribute tickets for the gala.

2.2. NYC Census Data

2.2.1. Receiving the data

We got the NYC Census Data we used from the following site:

2.2.2. Data Cleaning and Tranformation

Let’s take an overview of our NYC Census data:

First, let’s make our dataset simpler by droping the columns that we will not use because there are too many columns.

Let’s check if there are null values ​​in our dataset:

As seen above, we have null values ​​in our dataset, so let’s drop these values:

And now our dataset is ready for Exploratory Data Analysis.

2.2.3. Exploratory Data Analysis (EDA)

Since our aim was to find the boroughs with the highest number of women in New York City, we wrote the sum function that takes the sum of the numerical values ​​by grouping our dataset over the Borough column and sorted it from the most to the least over the Women column, so we saw the borough-based number of women sorted from the most to the least.

And we wanted to see this in a bar plot:

As seen in the plot, Brooklyn and Queens are the boroughs of New York with the highest number of women.

In order to find the boroughs with the highest average per capita income in New York City, which is our second goal, we grouped our dataset over the Borough column, took the mean of the numerical values, and sorted them from the highest to the lowest based on the IncomePerCap column:

And we wanted to see this in a bar plot:

As seen in the plot above, Manhattan has the highest average per capita income in New York City.

3. Conclusion

As a result, after Exploratory Data Analysis on MTA Turnstile Data, we found the busiest station to be 34 ST-PENN STA station, and the busiest day and time interval of this station as 12:00–16:00 on Friday. In addition, after the Exploratory Data Analysis on the New York City Census Data, we saw the borough with the highest average per capita income as Manhattan, and our analyzes supported each other since our busiest station, 34 ST-PENN STA station, is in the Manhattan borough. Apart from that, after our Exploratory Data Analysis on New York City Census Data, Brooklyn and Queens emerged as the boroughs with the highest women population in New York City, and if we take this idea as a basis, we recommend the FULTON ST station, which is our busiest station in Brooklyn and FLUSHING-MAIN station, which is our busiest station in Queens.

And we have come to the end of the article. In this article, I tried to explain our first project in İstanbul Data Science Academy’s Data Science Bootcamp in as much detail as I could. You can also access all the codes of the project by going to my Github repository from here. Also, if you want to reach me or ask a question, you can go to my LinkedIn profile from here and contact me. If you wish, you can follow me on Medium, I will try to explain my next projects and works in detail on Medium. Hope to see you in my next post. Have a good work everyone.

--

--