6 Tactics to Maximize the Quality of your Data Annotation

Learn techniques to make sure your data annotation tasks produce quality results

Michael
DataTorch
6 min readMar 2, 2021

--

Annotating MNIST in DataTorch. If it were always this easy, you wouldn’t be reading this article.

1. Provide Example Sets

Also known as “golden”, “benchmark”, or “ground-truth” datasets, this is the most straightforward quality assurance technique you can employ. By presenting your annotators with example datasets containing perfectly annotated instances of each class you wish to label, you create a useful training tool as well as a baseline that enables you to calculate agreement metrics to evaluate each annotator’s accuracy. Seeing just a few examples can intuitively teach annotators each class’ defining characteristics. Including edge cases will provide clarity for them in situations where the label might be uncertain.

If you intend to calculate accuracy metrics with these example datasets, the composition of the classes within the dataset will be important; each class should either be represented equally, or have the same distribution as the data to be annotated.

2. Calculate Normalized Agreement Metrics from Example Sets

Once you have a good example dataset, you can use it to calculate metrics to determine a particular annotator’s performance.

The simplest accuracy metric you can calculate is to simply measure the proportion of annotations which agree with the example dataset. However, this value is typically adjusted to take into account the effect of randomly guessing. The normalization formula is as follows:

(Raw Accuracy - Probability of Guessing Correctly) / (1 - Probability of Guessing Correctly)

The “Probability of Guessing Correctly” can vary depending on what you want your normalization metric to be. For example, in a hypothetical task where the annotator has to mark whether people are left-handed or right-handed from a photo of people writing with their dominant hand, you could use 0.5 if you wanted your normalization metric to be a totally random guess, or 0.9 if you wanted the metric to be guessing the most common class — that is, guessing right-handedness every time (which results in a value of 0.9 because roughly 90% of people are right-handed). You could also consider using the frequency of the class as it occurs in the dataset.

It is also important to understand how to interpret these agreement statistics: The value ranges between -1 and 1. A value of 0 indicates that agreement is equivalent to random chance, while a value of 1 indicates perfect agreement. A value of -1 would indicate perfect disagreement. Any negative values indicate that the agreement is lower than random chance, and this usually indicates the annotator has a fundamental misunderstanding of the annotation task.

3. Random Sampling QA

Once your data is annotated, you can then randomly sample subsets of your dataset and re-inspect it for accuracy, allowing you to gauge the general quality of your data. The amount of sampling you will want to employ depends on your needs and bandwidth, but this technique can go much deeper than simply analyzing the quality of your entire dataset.

For example, randomly sampling the same number of data points from two different annotators allows you to compare their performance. You can also randomly sample different classes to determine which examples are harder to annotate, or sample subsets of data that were collected in different manners, to determine if there was an effect on your data quality.

Annotation pipeline design is an often overlooked yet vital part of the data annotation process, as the composition of your pipeline can significantly impact the cost and quality of your dataset. Incorporating random sampling into your pipeline will give you a foundation of quality control that does not require the upfront work that providing example datasets will, but it will increase the time and effort needed to complete the annotation task.

4. Perform Multiple Passthroughs

Consider having each image (or file) annotated more than once by separate annotators. This may not be necessary for simple tasks which require minimal training. However, as the complexity of the task increases, the effectiveness of having multiple passthroughs will increase as well.

Automated identification of diabetic retinopathy from images is a prototypical example of deep learning applied to medicine, and a perfect instance of where multiple passthroughs are necessary for complex annotations: for trained doctors grading the severity of diabetic retinopathy from photographs, the intra-rater agreement was only around 60% to 70% — with no correlation between agreement and the amount of experience the ophthalmologists had!

In cases where it is impractical to annotate each file multiple times, you can alternatively perform multiple passthroughs on a small portion of the original data, which will give you an indication of the overall health of the dataset. Some data annotation platforms, such as DataTorch, actually have the ability to set the proportion of the dataset you wish to have repeated, and will distribute the work amongst your annotators automatically.

5. Calculate Inter-Rater Agreement Metrics from Multiple Passthroughs

Once you have done multiple passthroughs, you can to measure inter-rater agreeability, which quantifies “to what degree do my annotators agree with each other”.

This metric is applied to qualitative annotations that are mutually exclusive (for example, when tagging an entire image with metadata, or having multiple grades within a single label) and can be calculated using several different formulas, such as Cohen’s Kappa (for two annotators), Fleiss’ Kappa (for more than two annotators), and Krippendorff’s alpha (refined multi-annotator agreement metric).

We won’t go into the details of calculating these metrics, but there are plenty of resources online and in statistical literature. The formulas are very similar to the ones we used above to calculate agreement with our example datasets, and can be interpreted in a similar way. Generally speaking, a value above 0.8 for multi-annotator agreement metrics indicates high agreement and a healthy dataset for model training.

6. Mark Edge Cases for Review

When setting up your data annotation pipeline, there are various architectures you can implement involving any sequence or combination of in-house annotation teams, crowdsourced/outsourced annotations, trained experts, or automated annotations.

One issue with employing trained experts instead of crowdsourced/outsourced annotators is the increased cost and time burden associated with getting expert annotators. One way to circumvent this cost while also increasing the quality of your dataset is to design your pipeline in a way that outsourced annotators perform the initial pass of your dataset, and then mark edge cases for further inspection by an expert.

Determination of an “edge case” can be done either by thresholding the inter-rater metrics listed above, or flagging by individual annotators. Files marked as edge cases can then either be sent to expert annotators for final say, or alternatively pushed into a queue for multiple passthroughs for more rigorous inspection.

--

--