Our analysis of Hackathon Projects: Part 3

Kanishk Tripathi
5 min readJan 6, 2016

--

In this post, we’re going to mention more results from our analysis of Hackathon projects. Some basic trends and results were discussed in Part 1 and 2. This post mentions about the trends observed based on the team size and experience in winning projects. We also try to fit a classifier on the projects to predict a future winning entry. This data was viewed on a sub-sample of projects(10000 projects).

Feature vector: Our feature vector consisted of filtered tags, team experience and number of members in the team. The details on feature vector and how we preprocessed the data is mentioned in Part 1. For consistency, the threshold for team members was set to 4. If a team had more than 4 members, we limited the count to 4. The same principle was applied to team experience but the threshold was set to 10.

Team member and experience trends: We looked for more trends based on user experience. We found that there’s greater probability of winning at hackathon if you have more members in your team. Even if a person has more experience, the chances of winning are less if he’s working with less
number of people. Having a team of 3 or more people even with less experience increases chances of winning.

Winning project distribution w.r.t number of team members.

From the chart above, if the average experience is around 2 or 3 projects then the chance of winning is more, but whereas if it is around 9 or 10 the chance of winning is less. This conveys that the team having 3–4 members with average experience works better than the team which has only 1 member with experience of 10.

Discriminating tags: Tags which are most frequent in the projects do not mean that they are discriminative of the classes. To find the tags which
differentiate winning teams from non-winning teams we used the simple entropy based technique called Information gain. Higher the information gain of a tag, purer the partition is if a decision tree is being built using it.

About information gain: Suppose we were to partition tuples in partition D on some attribute A then, the Information gain for attribute A is mathematically given by the formula,
Gain(A) = Info(D) — InfoA(D)
Info(D) is intuitively just the average amount of information needed to identify the class label of a tuple in partition D. It is also known as the entropy of D, and is given by,

pi = probability that an arbitrary tuple in D belongs to class C​i

Now, supposes we wish to split partition D on attribute A having ​
v distinct values. Ideally we would want the partitions to represent only one of the classes, but it is quite likely that we end up with impure partitions. Thus for each attribute we calculate how much more information would we still need after partitioning on that attribute to arrive at pure partitions. It is quantified as,

Smaller the information required purer will be the partition. The attribute ​
A ​with the highest information gain, ​Gain(A), is chosen as the splitting attribute. This is equivalent to saying that we want to partition on the attribute ​A ​that would do the “best classification,” so that the amount of information still required to finish classifying the tuples is minimal. One of the disadvantages of this measure is that it tends to be biased towards the
attributes taking multiple values. Although, this doesn’t affect performance in our case since the attributes are binary.

In our data, Hardware, Node.JS, iOS had the one of the highest information gain.

Classifier:

Given the trends observed in the tags and team compositions, we tried to fit a classifier model to the data for predicting for future entries. We have used three different classifiers namely:
1. Naïve Bayes
2. C 4.5 [8]
3. Random Forest (Ensemble)

The classifiers were run using Weka. 10-fold cross validation was performed to assess performance of each classifier. Following table shows the performance of each classifier.

Random Forest Accuracy Vs. Number of Tag features:
Each feature vector in our training set has 110 tags. More is the number of features higher is the computational complexity and time required for training. In the following experiment we studied the effect of reducing the number of features on the performance of Random Forest classifier. From the chart below we can see that as the number of features (tags) reduces the accuracy of the classifier drops.

Addressing class imbalance problem: Class imbalanced data is a data in which the main class of interest is rare. In our data as the number of winning projects are almost only 10% of total data.

Class imbalance

It was found that using Random Forest Ensemble classifier gave good results on this dataset. We also tried techniques like oversampling, undersampling and SMOTE but it lead to reduced precision.

Conclusion: This is only the initial phase of the study of data and to learn the concepts of data mining. Any comments(even pointing out the mistakes) are most welcome as well as suggestions for improvements. We did not have any time based(temporal) data otherwise we could’ve looked for more patterns. Thanks to people at Devpost for helping with data and API.

--

--