Exploratory Data Analysis: Maximizing Audience Reach for Women in Tech
Women Tech Women Yes (WTWY) is a fictional organization that is seeking to host a summer gala in New York city to “increase the participation of women in technology, and to concurrently build awareness and reach.”
Primary Goal
As a data science team, we are asked to give recommendations on where to send the survey teams in order to maximize event effectiveness (# of attendees & donations).
Data
In order to achieve this goal, we will look at two data sets:
- New York city Metropolitan Transportation Authority (MTA) subway turnstile data (March to June, 2017 to 2019).
For measuring how busy a location is, in order to reach the largest number of potential attendees. - American Community Survey (ACS) 2015 demographic characteristics and economic characteristics.
To find the specific audience group that is most likely to attend the gala and donate to the cause.
And we will reach a conclusion in three steps:
Step 1: Data Cleaning
As a general rule, we always want to format the columns, drop duplicate entries, and identify outliers. Here is a great article for reference if you would like more details.
Using the MTA turnstile data set as an example, we will first format the columns to make sure that there are no hidden white spaces.
Then we want to find the total foot traffic at each turnstile, but we noticed that the “Entries” and “Exits” columns in the data set is a cumulative count of the traffic. So in order to clean this part of the data, we need to calculate the difference between rows in both “Entries” column and “Exits” column then add them together to get the total foot traffic:
We also noticed that for those turnstiles, the cumulative count would reset after the count hits a certain threshold. This indicates that the numerical difference between the count before and after the reset can be quite large, sometimes in the billions.
To remove the outliers generated by the resets, we will be using the concept of the Interquartile Range (IQR). Credits to my teammate Darien Mitchell-Tontar for coming up with this idea:
Step 2: Data Analysis
While there are many ways to analyze the data, the process remains the same, here I will share one of the example on how we group the data to find the females working as “Professional” in NYC per borough.
Step 3: Data Visualization
Here I will share the definition compiled from here and some of the charts we have made:
“C/A = Control Area
UNIT = Remote Unit for a station
SCP = Subunit Channel Position represents an specific address for a device
STATION = Represents the station name the device is located at
LINENAME = Represents all train lines that can be boarded at this station
DIVISION = Represents the Line originally the station belonged to BMT, IRT, or IND
DATE = Represents the date (MM-DD-YY)
TIME = Represents the time (hh:mm:ss) for a scheduled audit event
DESc = Represent the “REGULAR” scheduled audit event (Normally occurs every 4 hours)
ENTRIES = The cumulative entry register value for a device
EXIT = The cumulative exit register value for a device”
We assumed that females working as “Professionals” (which includes Management, Business, Science, & Art) in the census data are the most likely ones to attend and donate to the cause since this is a gala focusing mostly on the technology industry.
Conclusion
With all of the analysis we have done so far, we have come to the conclusion that we should focus primarily on the Manhattan borough for the best result per man power.
Here we summarized the TOP 3 locations to send survey teams to at the 2 busiest time frames. A clarification for the top choice, choice #1, is that the location contain the third busiest station: Penn station, and second busiest station: Herald SQ station, all within 1 block of walking distance. That’s why you are seeing 3 locations but 4 different stations.
Future Improvement
There are many areas that we can improve on the work we have done so far. For one, is that we have obtained the traffic for each turnstile:
And if we can obtain up-to-date architectural floor plans (specifically the “A sheets”) for each station, we can find out where exactly those TOP 10 turnstiles are on the map and recommend them to the survey teams to even further optimize the operation in terms of best result per man power.
Thank you for reading, and I hope this would bring helpful insight into whatever you do.