Measuring Labelling Quality with IOU and F1 Score

Isaac Tan

Published in

Supa Blog

5 min readMar 24, 2020

Intersection over Union (IOU)

Intersection over Union is an evaluation metric used to measure the accuracy of an annotation on a particular task.

What do we need?

In order to apply IOU to evaluate an annotation done by our fully managed remote workforce — our SupaAgents, we need:

Ground Truth bounding boxes, labelled and verified by our BizOps Managers
Annotation data from our SupaAgents

IOU can be computed as the Area of Intersection over Area of Union.

In the numerator, we compute the area of intersection as the space shared between the SupaAgent’s annotation and the ground-truth bounding box.

The denominator is the area of union, the totalarea encompassed by both the agent’s annotation and the ground-truth bounding box.

As you can see, Intersection over Union (IOU) is simply a ratio.

How we evaluate the IOU score

In general, if an IOU score is >0.5, it is considered a good annotation. However, this can vary from project to project. Our BizOps team can make adjustments to this threshold during the execution phase of a project, depending on your annotation requirements.

As it is almost impossible for our SupaAgents to get the exact (x,y) coordinates that match our Ground Truth annotations, we consider annotation with IOU score > 0.8 to be a good annotation.

IOU ensures annotations of our SupaAgents are matched as closely as possible to our Ground Truth Annotations.

Measuring the Quality of a Task

We have looked at evaluating the quality of a single annotation. Now let’s talk about how we evaluate the quality of a task.

Before evaluating the quality of a task, let’s get some terminologies out of the way:

True Positive (TP): Correctly drawn annotations that have an IOU score of >0.5.

True Negative (TN): The SupaAgent correctly chooses not to draw an annotation when the annotation was not required. No value here because the SupaAgent is not drawing any annotations, no way to calculate true negatives

False Negative (FN): Missing annotations

False Positive (FP): These are incorrectly drawn annotations that have an IOU score of <0.5.

We used to measure the performance of our SupaAgents’ work using the accuracy method for its simplicity. Accuracy is the most intuitive way to measure the performance of a task because it is simply a ratio of correctly drawn annotations to the total expected annotations (ground truth).

While it is an incredibly straightforward measurement for its simplicity, it’s also the least insightful when it comes to measuring the performance of an annotation task. In most real-life annotations, there’s a severe class imbalance and it does not take into account FN and FP, which could lead to bias or an incorrect conclusion on the quality of a task. Also, there’s no way to calculate true negatives for image annotation.

Precision and Recall

Precision is the ratio of correctly drawn annotations to the total number of drawn annotations.

For example, if a task contains 10 cars, 5 buses, and 10 humans based on ground truth; an agent correctly draws 5 car annotations and 5 human annotations with IOU>0.8, but incorrectly draws 5 car annotations and 5 human annotations with IOU <0.5, the precision of that agent will be 0.5.

True Positive = 10

False Positive = 10

Precision = 10/(10+10) = 0.5

Recall is the ratio of correctly drawn annotation to the total number of ground truth annotation.

Using the same example, the recall of that agent will be 0.4.

True Positive = 10

False Negative = 15

Recall = 10/(10+15) = 0.4

F1 score

Precision and Recall each optimise for very different measurements. Hence, an F1 Score is needed when we want to seek a balance between Precision and Recall.

F1 is the harmonic mean of Precision and Recall and gives a better measure of the incorrect annotation cases.

The F1 score takes FP and FN into account when measuring the quality of our annotation work. Thus, we are evaluating the quality of our task with F1, which takes data distribution, FP and FN into account.

Ensuring Quality

When a SupaAgent begins a work session, a set of tasks with known answers is statistically assigned to them (aka Ground Truth tasks). This is mixed in with the pool of tasks that they are meant to do for our clients.

When a SupaAgent submits a task, our system will begin to compute his/her F1 score.

If a SupaAgent’s F1 score falls below the minimum F1 threshold, the following steps will take place:

The system will stop the SupaAgent from working immediately.
The SupaAgent will get redirected to our SupaTutorial platform for retraining.
The task that they completed during the session will be flagged for inspection by our BizOps Manager(s).

We keep track of our SupaAgents F1 score over the lifetime of a project. So if a SupaAgent consistently makes a mistake, they will be removed from the project.

Conclusion: Why we do this

We built this accuracy system to help us deliver the highest quality annotation work for our partners. Why? At Supa, we understand the impact of data quality on the effectiveness of machine learning models.

IOU allows us to evaluate the quality of our SupaAgents’ annotations by comparing it to our Ground Truth data sets. The F1 score, in return, enables us to compute a value that defines the quality of each project.

This helps us to make swift adjustments and continuously improve the quality of our labelling standards.

Start a test project for free today and discover new ways of improving your labeled data quality. First $50 dollar on us.