Women Tech Women Yes — Summer Gala 2019 Case Study

As a Student at Metis Data Science Bootcamp, Chicago we were challenged to optimise engagement for their Summer Gala.

Mission Statement:

Women Tech Women Yes (WTWY) is a fictitious organisation based in New York City, looking to optimise the placement of their street teams in subway stations, in anticipation of their Summer 2019 Gala. Their goal is raise awareness and promote women in tech. The street teams will collect email addresses to send invitations to their Summer Gala. WTWY’s priority is maximise the number of emails collected and potential donations to their cause … this is where we come in …

Grand Central Station, New York City


We wanted to analyse individual turnstile data from subway stations across New York City in order to attain the busiest stations across the city. From this our goal was to isolate the peak days and times to place street teams at these stations.

Our second goal was to assess Census Data to target those more likely to show interest and potentially donate to WTWY’s cause.

(N.B. Due to the 4 day deadline on this project, we used 2017 turnstile data apposed to 2018, as the 2017 data had been collated into a single .csv file and importing this data was considerably faster.)

MTA Dataset Process:

The following assumptions were initialised when assessing this dataset:

  1. The busiest stations over provided more potential signatures.
  2. By combining turnstile entrance and exit data, we included those leaving the station alongside those entering.

When using a dataset with 8.88M rows (yep, 8.88 million!) it was most important to understand the dataset, assess the statistics of the data and clean any rows that may cause problems later on in the analysis. In order to do this, we standardised the column names, removed any duplicate entries and ordered the data according to each individual turnstile.

Each turnstile is defined by 4 components (“C/A”, “Unit”, “SCP”, “Station”) represented as 4 separate columns in the dataset, therefore we had to group the rest of the data in accordance of these four components to analyse each individual turnstiles.

(Documentation on individual components can be found here)

‘entries_index’ defined each individual turnstile

From here we calculated the daily entries and exits per turnstile. Here we made an assumption during this process:

If any turnstiles recorded over 1 million people per day, there may be an error. Therefore, we reset the counter to zero (as demonstrated below). To put this into context, 1 million people per day through an individual turnstile, would mean 11 people through it per second. In hindsight we could have lowered this boundary even further.

So what did we find …

We decided to focus on the top 5 busiest stations across the city (highlighted in light blue) and optimise the time of year, day and time to place street teams in order to maximise efficiency of smaller teams.

For each station we analysed the change in traffic over the first six months of 2017. Assuming the Gala was in July 2019, we decided that data from July onwards was unnecessary for the analysis.

During this analysis, we concluded that the foot traffic peaks in Penn Station at 350,000 on an approximate 4 weekly cycle, suggesting that weekdays are busier than weekends. This was also concluded for the remaining 4 stations, albeit at a lower capacity as expected. As each station demonstrated a fairly consistent cycle throughout January — July, it was not appropriate to suggest a specific month or time period to deploy the street teams. By recognising the 4 weekly cycle of peak traffic, our next step was to investigate the data on a weekly basis.

Using the individual turnstile data we grouped each date by the corresponding day of the week. This allowed us to visually analyse which days over the year are busiest for the 5 stations. Further work into this would to take this analysis and apply it to each individual station.

Interestingly, Saturday was the busiest day of the week across the top 5 stations, whilst traffic throughout weekdays remained fairly consistent. Due to the location of all 5 stations in Manhattan, we believe that tourists provide most of the traffic throughout these stations. This highlights the significance of acknowledging demographic data in accordance with targeting those most likely to attend the gala.

Each turnstile at the subways take a count of people on an approximate 4 hour basis, therefore we could analyse traffic per day in 4 hour increments. By combining weekday and weekend foot traffic, we can assess the busiest time of day across the 5 stations.

Throughout the week, the optimal placement of the street teams would be later in the afternoon and evening. Although traffic is moderately high from 8am — 12pm, those commuting are more likely to stop on their way out from work rather than in the morning.

As expected, the distribution of traffic differs on the weekend and total numbers are much lower. This is potentially influenced by the decrease in traffic on Sundays. However, later in the afternoon and evenings do prove to remain the busiest time of day.

NYC Census Data Process:

Firstly, we made two assumptions regarding the gala and the demographic data.

  1. Those with higher income are more likely to donate
  2. Females may be more interested in the event than Males.

With these in mind, we focused on gender distribution across each Borough in New York City and Income per Capita. Similarly to the process dealing with the MTA Data, we checked for consistency and outliers in the data to prevent a skewed analysis.

As our top 5 stations are located in Manhattan, we wanted to evaluate the demographics in Manhattan in order to determine whether the residents are people we wish to target for the Gala.

Calculating average income per capita across each New York Borough
Calculating percentage of women living in each New York City Borough

Fortunately, across New York City, those living Manhattan have the highest income per capita. Locating street teams in Manhattan targets those more likely to donate to WTWY.

Secondly, we analysed the distribution of Females across the city. As shown here, the highest proportion of females like mostly in Brooklyn and Queens. Therefore, with further time, we may go back to our initial set of stations and consider representation in some across Brooklyn and Queens.

Conclusion & Recommendations

We would advise for WTWY to initialise their street teams in the following stations:

  1. 34 St — Penn Station
  2. Grand Central — 42nd Street
  3. 23rd Street
  4. 34 St — St Herald Square
  5. Times Square — 42 St

However, to optimise the highest traffic, we would suggest to focus the street teams to be deployed Monday — Saturday from 4pm, anytime between January and July 2019.

Further Study:

This project can be extended further by analysing day and time MTA data for each station. Therefore, more specific conclusions to which stations should be targeted throughout the day can be drawn.

Further analysis using the census data could give insight into subway usage among the population, reducing influence from tourists using the subway. Further consideration of the location of Business and Tech Hubs throughout the city could also influence the location to place street teams. By integrating social factors and weighting their importance, may give a more rounded analysis alongside the MTA subway data.