Exploratory Data Analysis: Maximizing Audience Reach for Women in Tech

Da Guo
Analytics Vidhya
Published in
5 min readApr 6, 2020

Women Tech Women Yes (WTWY) is a fictional organization that is seeking to host a summer gala in New York city to “increase the participation of women in technology, and to concurrently build awareness and reach.”

Primary Goal

As a data science team, we are asked to give recommendations on where to send the survey teams in order to maximize event effectiveness (# of attendees & donations).

Data

In order to achieve this goal, we will look at two data sets:

  1. New York city Metropolitan Transportation Authority (MTA) subway turnstile data (March to June, 2017 to 2019).
    For measuring how busy a location is, in order to reach the largest number of potential attendees.
  2. American Community Survey (ACS) 2015 demographic characteristics and economic characteristics.
    To find the specific audience group that is most likely to attend the gala and donate to the cause.

And we will reach a conclusion in three steps:

Step 1: Data Cleaning

As a general rule, we always want to format the columns, drop duplicate entries, and identify outliers. Here is a great article for reference if you would like more details.

Using the MTA turnstile data set as an example, we will first format the columns to make sure that there are no hidden white spaces.

Clearing white space from column titles.

Then we want to find the total foot traffic at each turnstile, but we noticed that the “Entries” and “Exits” columns in the data set is a cumulative count of the traffic. So in order to clean this part of the data, we need to calculate the difference between rows in both “Entries” column and “Exits” column then add them together to get the total foot traffic:

Code we used to calculate the total foot traffic.
An illustration showing how the “Delta” column is calculated and created.
The “DELTA” column calculates the total foot traffic between audits, in this case it measures the total foot traffic of the turnstiles every 4 hours.

We also noticed that for those turnstiles, the cumulative count would reset after the count hits a certain threshold. This indicates that the numerical difference between the count before and after the reset can be quite large, sometimes in the billions.

To remove the outliers generated by the resets, we will be using the concept of the Interquartile Range (IQR). Credits to my teammate Darien Mitchell-Tontar for coming up with this idea:

How we defined the IQR in our analysis.
An illustration demonstrating the general concept of IQR.

Step 2: Data Analysis

While there are many ways to analyze the data, the process remains the same, here I will share one of the example on how we group the data to find the females working as “Professional” in NYC per borough.

How we found the females working as “Professional” in NYC per borough.
The result would look like this.

Step 3: Data Visualization

Here I will share the definition compiled from here and some of the charts we have made:

“C/A = Control Area

UNIT = Remote Unit for a station

SCP = Subunit Channel Position represents an specific address for a device

STATION = Represents the station name the device is located at

LINENAME = Represents all train lines that can be boarded at this station

DIVISION = Represents the Line originally the station belonged to BMT, IRT, or IND

DATE = Represents the date (MM-DD-YY)

TIME = Represents the time (hh:mm:ss) for a scheduled audit event

DESc = Represent the “REGULAR” scheduled audit event (Normally occurs every 4 hours)

ENTRIES = The cumulative entry register value for a device

EXIT = The cumulative exit register value for a device”

TOP 10 Busiest stations by daily average.
Average foot traffic of TOP 4 busiest stations by weekday.
Busiest hours at TOP 4 stations.
Female income per capita across all 5 NYC boroughs.

We assumed that females working as “Professionals” (which includes Management, Business, Science, & Art) in the census data are the most likely ones to attend and donate to the cause since this is a gala focusing mostly on the technology industry.

“Professional” in the census data is defined as “Management, Business, Science, & Art”

Conclusion

With all of the analysis we have done so far, we have come to the conclusion that we should focus primarily on the Manhattan borough for the best result per man power.

Here we summarized the TOP 3 locations to send survey teams to at the 2 busiest time frames. A clarification for the top choice, choice #1, is that the location contain the third busiest station: Penn station, and second busiest station: Herald SQ station, all within 1 block of walking distance. That’s why you are seeing 3 locations but 4 different stations.

An illustration highlighting the best time frames to go to the top 4 stations with respective potential foot traffic.

Future Improvement

There are many areas that we can improve on the work we have done so far. For one, is that we have obtained the traffic for each turnstile:

C/A + Unit + SCP identifies an unique turnstile, like a serial number

And if we can obtain up-to-date architectural floor plans (specifically the “A sheets”) for each station, we can find out where exactly those TOP 10 turnstiles are on the map and recommend them to the survey teams to even further optimize the operation in terms of best result per man power.

Illustration of how further work can be done. Background map source.

Thank you for reading, and I hope this would bring helpful insight into whatever you do.

--

--

Da Guo
Analytics Vidhya

Architecture, Software Development, Data Science. Always learning.