Assuring Quality in Data Labeling

Our experience delivering quality labels for machine learning with crowdsourced workers

Published in

Supa Blog

4 min readOct 31, 2023

Acquiring high-quality labeled data has been a long-standing challenge for the data labeling industry. Challenges include:

Diverse Needs: Every ML project has its own specific requirements.
Rising Standards: As models get better, they need better-quality data.
Identifying Proficient Labelers: Lack of visibility in discerning the traits that make an effective labeler.

Given the above challenges, a key question we had was: How might we improve quality in data labeling, taking into account different needs of projects, rising standards and the need to identify proficient labelers?

It Starts With Measuring

The starting point for improving quality was figuring out how to measure it. We couldn’t move a needle that didn’t exist yet.

Initially, we approached quality by identifying what it isn’t: inaccurate labels. The goal was to minimize labeling errors as much as possible based on specific use cases.

However, how do we define inaccurate labels? On what basis? The answer was quite simple in hindsight. Every data labeling project came with a set of instructions from the client that’s the source of truth for labelers working on the projects.

The rules for what makes a label wrong came from the instructions given by the client. These instructions are like the ultimate guide for people doing the labeling. By comparing the labels to the instructions and spotting any mistakes, we could figure out how well an individual or a whole project was doing, and this helped us know where to make improvements.

Example: Categorizing Mistakes for Image Annotation

Image annotation involves assigning labels to pixels in an image in different use cases. Consider a simple project where a labeler would have to draw bounding boxes around cats and dogs.

For image annotation, mistakes could be divided into the following general categories based on prior research:

Misdrawn Annotations: Annotations with bad boundaries e.g. too tight or loose.
Mislabeled Annotations: Annotations with the wrong label e.g. cat labeled as a dog
Extra Annotations: Unnecessary or additional annotations that don’t fit project instructions
Missing Annotations: Annotations that should have been drawn but were not

Illustration of potential mistakes in our example project

Introducing the Accuracy Scorecard

With that, this led to the conception of the Accuracy Scorecard. Think of this as a precise ledger where we recorded mistakes made in image annotation projects, based on the aforementioned mistakes. We used a simple formula to do this where:

Application of the scorecard gave us a clear view of performance at both the project and individual level.

Example of Annotator Scores in a Given Project

The Result

Implementing the scorecard across multiple projects proved a resounding success. Projects started to improve slowly but steadily over time due to the following key factors:

Feedback for Labelers: We were able to create individualized feedback for data labelers with the data provided by the scorecard. This extended to correcting gaps in their understanding of the project instructions, e.g. many extra/missing mistakes typically denoted that they did not fully understand the project requirements.
Culture Shift: Labelers engaged with SUPA as a community to solicit feedback and advice on how to improve their scores. They even collaborated with each other in small teams to validate each other’s work and understanding of the project
Root Cause Identification: Visibility of scores at the project level helped us trace quality challenges to their root cause and fix them, e.g. lack of clarity in instructions, understanding of tooling

Curious about how we approach data labeling? Visit supa.so to find out more about our perspective on data labeling in 2023!