Our analysis of Hackathon Projects: Introduction

Hackathons in the past few years have become more than just programming competition. Apart from showcasing the talent, they have become the place to network. With sponsors getting involved, hackathons(especially at colleges) are like new career fairs. We decided to look at various trends in projects submitted at hackathons. The hackathon projects and user data is useful in a way that it can help in finding people with the right skills especially for companies looking to hire interns.

Source: The hackathon data was obtained from Devpost. The data used for analysis was until November 2015. The data was obtained using their API in JSON format.

Data: The data was in two categories. The hackathon projects data and user data. The projects data consisted of attributes like project description, members, project tags, like counts, whether it won at a hackathon or not. The tags were either the type of project(mobile, hardware or web) or the technologies and tools used in developing the project. The user data consisted of user name, location, number of hackathons, number of projects. Overall, we had around 35000 projects and 45000 users. We wrote python scripts for data preprocessing and formatting. The complete source is here.

Preprocessing: Before doing any analysis, we performed some preprocessing on the projects data. Initially, sampling of the complete data was done. We selected data of following major hackathons: PennApps, MHacks, Hack the North, HackGT, HackSC and YHacks. Combining the data, we had a sub sample of around 10000 projects. We were analyzing projects based on their tags. The tags which occurred in the number of projects less than the specified threshold were removed. Also the projects without tags were not considered. Following were the results of filtering:

On complete data: For a threshold of at least 100 projects, we obtained 126 unique tags. The tag count decreases with increasing threshold count.

On sampled data(10000 projects): For a threshold of 55 projects, 113 unique tags were found.

This preprocessing helped in filtering out noisy data. For example, we were looking to find projects created using swift language and one of the tags found was “taylor-swift”. Such tags had to be filtered. In all the cases 1–5% tags covered 90–99% of all projects.

Feature Vector: ​After data cleaning we created a feature vector of all projects which consisted of project tags, number of members (n) in the project and the experience of project team members. The experience of team members was the cumulative total of the number of projects in which they had participated.

We used the binning process in creating feature vector. We restricted the maximum number of numeric attributes like member count and experience to a specific threshold. If a team had greater than 4 embers, we kept the member count as 4. If the project count of all team members was greater than 10, we kept that count to 10.

What we were looking: We looked at following patterns and trends on the data:

  1. The tags occurring in maximum projects and winning projects.
  2. Tags with highest win percentage.
  3. Winning trends based on team members and team experience.
  4. Using a classifier to predict a future winner based on team composition and project tags. This was done on the sampled data.
  5. Participant distribution across US and various states. Of all the 45000 users, around 26000 users had specified location in their profile of which around 17000 were in US.

General results: The following section shows the results for the tags in projects and which tags had maximum win percentage. Here we’re using the complete data.

Tag cloud of projects. Bigger, the text means more the number
Tag cloud for winning projects
win to project ratio ( minimum 100 projects)
Win to projects ratio (minimum 500 projects)
Scatter plot for the projects to win.

The next post of the blog series will discuss about participants and their locations across the US.