Designing a Better Judging System

Competition judging is hard. Coming up with fair methods to select rankings is difficult, and it’s especially hard to design judging methods for large-scale events.

As an attendee of many hackathons, and as an organizer of HackMIT, I wanted to figure out a way to improve the quality of judging at hackathons and large-scale competitions in general. Sometimes, poor judging is a result of neglect or lack of organization. However, most judging methods themselves are inherently flawed. It’s possible to do a lot better.

When trying to design a good judging system, large-scale events in particular pose a unique challenge. Because the events are so large, no single judge can judge all the entries. In fact, at HackMIT, the average judge looked at only 5% of the projects. The challenge is to figure out what data to collect and how to combine the data to come up with good overall judging results.

Given the size of the event, the number of judges, and the amount of time there is for judging to take place, there is a fundamental limit on how much data it’s possible to collect. The question is: given the constraints, how do we produce the highest-quality judging results possible?

It turns out that we can radically change the judging model, and instead of having judges produce absolute scores, we can ask judges to perform pairwise comparisons. Using statistical inference methods, it’s possible to take data of this form and produce high-quality judging results.

The method of pairwise comparisons actually works pretty well in practice! We relied on this method for judging at Blueprint, HackMIT’s high school hackathon, and we’re thinking of using some form of this judging method starting from HackMIT 2015 as well.

Math doesn’t show up that well on Medium, so the rest of the post is here: