Gender Representation in Hackathon Hackers*

*Now with data


Hackathon Hackers (HH) has become the biggest Facebook group for hackathon attendees. Currently, it has over 18k members. It is a place to discuss hackathons, tech news, college, high school, and dank memes. In order to cover everyone's hackathon content needs, the original creator of HH, Dave Fontenot, has encouraged independent members to create a HH group for a common interest. There, a subsection of members can discuss content related to the topic. Once the HH group has a certain amount of activity, Dave and the admins promote it to a HH subgroup. Popular HH subgroups are: HH Design and HH What are You Working On?

HHers are a part of the tech job pipeline

Hackathon Hackers covers a specific subsection of the tech community, hackathon attendees. Generally, hackathon attendees are high school to college aged students preparing to get a high tech job after graduation (or dropping out). And, companies are interested in hiring these attendees that show initiative and are learning unique skills outside of the classroom. Bigger hackathons, like HackMIT, have consistently raised hundreds of thousands of dollars for their weekend event through company sponsorship. The hackathon attendees are hired by companies for the highest paying internships in the country. In short, HHers are a part of the tech jobs pipeline.

Data source

HH is over a year old now. During that time, its members have held many discussions. Their discussions ranged from announcing hackathons, showcasing members’ own work, discussing tech news, and discussing diversity in tech issues. Alex Kern, a HH admin, collected all of the interactions on HH and the affiliated groups, and made the results public. The time range for this data is 7/1/2014–8/20/2015. Alex only collected information from HH affiliated and public groups. The content wasn’t created in real time, so it does not include deleted posts. It includes information about members’ likes, comments, and posts.There are so many ways to slice this data.

Investigation inspiration

My inspiration for the investigation of the HH data, Tess Rinearson, a prolific hackathon attendee and host, publicly quit hackathons after a HH post. Losing someone that big was bad. And, it left me wondering if there were other women silently quitting hackathons because of representation issues in the group. So, today, I focused on how much women were represented in the groups.

Further focus

I had 3 questions related to diversity in HH:

  • Are HH gender representation percentages better than industry?
  • Has gender representation changed over time in HH?
  • Are certain HH groups better than others in representing?

These three questions focus on female representation in HH and its public subgroups. I focused on representation because I assumed higher female representation would indicate HH is more welcoming to women. It would be more so in two ways:

  • Women would self identify with more discussions.
  • Normalize women in tech for everyone in the group.

To answer the above questions, I dove into the data.

Method

In order to measure how much women were represented in the HH ecosystem, I counted how much of the content contained male or female words. This did not include sentiment analysis along with mention of gender, and would be interesting to investigate this further.

First, I normalized the contents’ text by lower casing all text, and removing punctuation. To determine gender representation in HH content, I analyzed the text in posts and comments. I labeled content that contained “male words” and “female words” in their respective “contains this gender” binary variable. Some examples of a “male word” are he, him, bro, and father. Some examples of a “female word” are she, her, sister, and mother.

I chose to look at gendered words because posts are what new members would see first when deciding if a group is for them. For example, “My sister made this awesome hack! Hope you like it!” and “My nephew made this awesome hack! Hope you like it!” has no difference in the actual content of the hack. But, as a women in tech, if I don’t see posts similar to the former at a rate reflecting at least industry numbers, I won’t think the group is for people like me.

All gendered words are words used in English or colloquially on the internet. Also, they are words that are generally used for one specific gender*. The complete collection of “gendered words” are found here. From these variables, I could determine percentage of content containing:

  • Content contains a female word => CF
  • Content contains a male word => CM
  • Content contains a female and a male word => CFM
  • Content contains no female or male words => CN

From there, I calculated a couple of key metrics:

  • Number gendered content => CG = CF + CM - CFM
  • Percentage gendered content => %CG = CG / (CG + CN) x 100%
  • Percentage female content => %CF = CF / CG x 100%

I focused on 2 different types of member content: posts and comments. Posts are content a member submits directly to a group. Comments are content a member submits to a post. Posts are higher profile because everyone sees the content when it shows on their newsfeed.

Below shows the results of my investigation.

Results

The general metrics for the complete data set (7/1/2014–8/20/2015) are below. These results are used in the three different answers to the questions investigating.

  • Total content for all time: 330,516 (100.00% of total)
  • Total gendered content: 22,671 (6.89% of total)
  • Total female content: 4,430 (19.54% of gendered)
  • Total posts only for all time: 36,283 (10.50% of total)
  • Total gendered posts only: 4,517 (13.07% of posts)
  • Total female posts only: 620 (13.71% of gendered)

These numbers also give you an idea of the difference between total content and posts. All %FC were out of the amount of female posts vs gendered posts in the population. Posts, which are more visible than comments, have a statistically significantly lower percentage of female representation compared to the complete set. This is an interesting difference which shows that men are talked about more in the post, and women are talked about more in the comments. This could indicate the bias between what members deem worthy to post in HH, and the gender split between people who comment on HH.

Earlier, I wrote a general summary. For more information about magnitude of content, visit my summary.

Comparison to industry standards

Many people talk about the pipeline as a fix to the tech gender gap. The pipeline theory generally assumes the way to close the gender gap is to wait on the people currently studying to move into industry. It assumes the younger generation is more diverse than current industry standards. Many of HH’s members are in either high school or college. Therefore, for the theory to be shown as true in HH, the percentage of female representation would have to be much higher than industry.

In 2013, Tracy Chou asked, “Where are the numbers?”. Since then, multiple companies have released their diversity information. There are a lot of companies’ diversity data to choose. So, I chose from 3 categories: highest paying companies in the HH data set, the big 3 silicon valley companies, and universities. Of the top 10 highest paying HH employers, only 3 employers have released their tech worker diversity data. They are Dropbox, Groupon, Yelp. The next group were the big Silicon Valley tech companies: Google, Facebook, and Apple. Lastly, I included NCWIT’s percentage of women graduating with CS degrees and Harvey Mudd’s numbers as a comparison of the average and above average CS department percentages. These four groups, HH, highest salary HH employers, the big 3 SV tech companies, and universities, women representation percentages are graphed below.

Color denotes category membership

Although the pipeline theory suggests that the high school and college aged people will fill in the gender gap, these numbers are reflecting industry average. In fact, the percentage of posts (not comments) containing a female word is lower than industry representation. If this data is any indication, we probably shouldn’t rely on the pipeline to fill in the gender gap.

Representation over time

Tess’ post announcing her retirement from hackathons was in August 2014, the very beginning of HH. Since then, HH has had many discussions about gender issues in tech. So, I expected the percentage of female gendered posts to go up over time.

First, I looked at the total content’s percentages.

blue = month’s percentage red = average percentage over all time

The overall trend was a very noisy signal centered around the mean of 19.50%. In further investigation, I took the 3 month moving average, and it did not have a trend up or down.

Next, I examined the percentage female posts over time, assuming this category had more room for improvement. All content’s percent female representation was 19.50%, but only including posts’ female representation, the percentage dropped to 13.71%. The below graph visualized the changing percent representation over time.

blue = month’s percentage red = average percentage over all time

This view also didn’t have trend over time. But, it seems to have cycles with a peak to peak of about 4–5 months. I’m not sure if this is noise, or an indication of a latent source I haven’t found yet. It would be interesting to investigate this further, especially in a few months with more data.

The percentage of female mentions in content has not changed over time. But, the graph below shows the percentage of all posts and comments mentioning at least one gender has increased over the past year.

post trend slope = .15 comment trend slope = .06

The percentage of comments gendered has not changed significantly over time. But, the percentage of posts has. Perhaps the conversation about gender in tech has elevated the percentage of posts that include a gender, but has done little to close the gender gap.

Comparison between groups

There could be one or two groups that are incredibly active, but have a very low percentage of female mentions in gendered posts. If this is the case, there should be a few groups under representing, and a lot of groups over representing to compensate for the under performers. So, to determine the combination of an active group that uses gendered words has a statistically significant percentage of female content than the average, I used 1-sample binomial hypothesis testing. Only groups with p > 0.05 were considered notably different from the mean.

There are 53 HH public subgroups and Hackathon Hackers makes the total group count 54. As a reminder, the total average for all content was 19.50% female representation.

Out performers:

The 5 groups with a statistically significant % CF than the mean were: HH Constructive Debates, Hackathon Hackers, Latin@ Hackers, HH: Snackathon Snackers, and HH Hacker Problems.

Ordered by descending statistical significance

Shout out to HH: Snackathon Snackers, HH Constructive Debates, and Latin@ Hackers for having over 30% of all gendered content including female words. That is amazing!

The total amount of gendered content between these 5 groups 17,972 posts. Hackathon Hackers had 17,245. The rest of the groups had a combined 725 gendered posts. HH’s %FC at 21.05% shows the main stream group has slightly better than average representation than industry. But, 21% representation is nothing ground breaking, and does not beat Apple’s tech worker percentage (22%).

Under representing

The 18 HH groups that have a lower % female representation in their content are: HH iOS, HH: What Are You Working On?, HH Websites and Resumes, HH Data Hackers, HH: Javascript, Stackathon Stackers, HH λ, HH MHacks, HH Hardware Hackers, HH: VR, HH Canada Eh?, HH Webdev, HH Skillshare, Hackathon Hackers EU, HH Ruby, HH Python, HH Growthhacking, and HH Design. Their respective % CF is shown below in the graph.

Ordered by descending statistical significance

The total amount of gendered content over these 18 groups was 3,918. This list also contains 4 of the 5 most active subgroups. 4 groups have lower than 5% female representation in their gendered posts. These groups that under represent women in their content are not negligible. Many of the content is performing way under the baseline of industry standard. I wonder how many women have self selected out of these groups because the group didn’t post about people like them. Some groups do have good representation. But, when the average %CF is so low over all groups, the sheer number of groups under the average is discouraging.

Conclusion

Through this post, I explored three questions through the HH data.

Are HH gender representation percentages better than industry?

No. They are about average. From this data, we can not expect the pipeline to noticeably bridge the gender gap in the coming years.

Has gender representation changed over time in HH?

No, the percentage of female mentions on gendered content did not go up over time. But, there was a slight increase in percentage of posts containing a gendered word over time.

Are certain HH groups better than others in representing?

Yes, some are statistically significantly better than average, but 1/3 of HH subgroups are statistically significantly worse than average.

These results were far from ideal. Ideally, the results should have shown the group was growing %CF over time and was consistently better than industry representation, but it didn’t show that.

Take away

HH is becoming a big step on the technology pipeline, yet the numbers show we aren’t at an ideal level of female representation. Thankfully, this is easy to change. We can all help by not hesitating to post about or converse with women in tech groups. Including women in discussions is one of the main ways to create an inclusive culture in a group. With this small change, we can deter HH and subgroups from being a bottleneck for women in tech and hackathons. Happy posting y’all!

(If you liked the article, please hit the recommend button below ↓)


Thanks to everyone who helped me on this!

*Note for methods: This analysis focuses on men and women as expressed in the text content of posts. It does not include non-binary gendered words. It also does not focus on the members’ genders, or the interactions of members based on their gender. Racial diversity is not a focus of this analysis either. These focuses are interesting, just not the one I used on this article and should be explored later.