How I found nearly 300,000 errors in MS COCO

Inaccurate labels are a silent tax on computer vision models

7 min readJul 26, 2022

--

TL;DR — Bad labels are a major problem in AI. I’m building a company to solve this, and have a new approach that finds 10x more label errors than existing work. Click here to contact me. Send me your dataset and I’ll tell you what’s mislabeled! (Version that doesn’t require sharing data coming soon)

About me: I am a UC Berkeley PhD in explainable AI who spent time at Facebook AI and Google Brain, has been cited 1200+ times, and founded and sold, Clientelligent, an AI startup.

COCO label (solid line) and FIXER correction (dotted) in MS COCO. The COCO label cut off the baseball player’s legs

You’ve gathered a large dataset and trained the latest deep learning architecture to a high accuracy. But your model still isn’t as accurate as you need it to be. What do you do next?

In my conversations with over 50 ML teams, this was a regular scenario. The most effective next step — fixing your dataset. Strong teams know that inaccurate labels lead to decreased model accuracy (“garbage in, garbage out”) and confusion in evaluating models (“was the model wrong, or the data?”). These problems persist regardless how big the dataset, or how fancy the model.

Unfortunately, finding and fixing label errors is no easy task. AI engineers can spend days manually looking through countless images trying to find and fix bad labels. Needless to say, this is an expensive process that no one enjoys.

To solve this problem, I developed FIXER, a new approach for finding errors in datasets. Rather than manually searching through labels, FIXER uses novel explainable AI techniques to flag potential errors for manual review. On MS COCO object detection, I estimate¹ that FIXER is able to find 273,834 errors, equal to 37% of total annotations, and that 46% of all COCO images contain at least one error. To the best of my knowledge, this is the most errors found in any publicly available machine learning dataset, by a large margin (previous works have estimated an average error rate of 3%).

How I can help

If you would like to use FIXER on your computer vision dataset, please contact me. I provide a consulting service: send me your dataset and I’ll send you a cleaned version back.
I am also developing Breakpoint, a no-code UI for exploring and improving computer vision datasets using FIXER (without the need to share data). If you would like to be a design partner, or be placed on the waitlist, please sign up here.

How accurate is MS COCO?

MS COCO is one of the most widely used datasets in AI, with over 25,000 citations, 700,000 annotated objects and 118,287 images. It took a considerable amount of time (over 70,000 hours) to create, and the creators had eight labelers examine each image.

You might expect such a carefully built, ubiquitous, dataset to have reasonably accurate labels. It was surprising when FIXER found that almost half of COCO images contained a label error.

FIXER can find multiple error types

The table below summarizes the different types of errors found in COCO.

In detail, the different errors FIXER uncovers include

1 — Background errors: Missing labels, that have no overlap with existing labels

Missing annotation in COCO containing car waiting at train crossing

2 — Overlapping object errors: Missing labels that overlap with existing labels

Missing chair (dotted) overlaps with correctly annotated chair (solid)

3 — Localization errors: Labels with incorrectly drawn bounding boxes

COCO label (solid line) and FIXER correction (dotted). The COCO label cut off the woman’s legs

Haven’t other people done this before?

If you’re still reading, you may have thought about this problem before, and even tried other approaches to solve it, such as the popular “confident errors” heuristic. This approach looks for predictions where a very high probability is given to a label other than the provided one, i.e. where the model is “confidently wrong”.

If you apply this popular existing method to COCO, you will certainly find some label errors. But FIXER finds 16 times more errors².

Confidence scores for the confirmed FIXER errors on COCO. Most of the verified FIXER errors have low confidence, and 94% of them have confidence below 0.9

Conclusion

Low quality labels are a big problem that is hard to solve. In this post, I introduced FIXER, a new methodology for finding label errors in AI datasets that found 273,834 errors, equal to 37% of the total annotations, in MS COCO. While I focused on the outputs of FIXER in this post, I intend to present the underlying methodology in future work.

In ongoing work, I am developing results for other prediction types such as semantic segmentation and image classification. I am also expanding FIXER to help with active learning by searching for hard edge cases. This will reduce label costs by only labeling helpful images, and produce more accurate models.

If you’d like to hear about future posts, consider following me on Medium, Twitter or Linkedin. If this problem intrigues you, I’d love to chat: hello@setbreakpoint.com. We’re also actively looking for design partners/building a waitlist for Breakpoint (our no-code UI for improving models), consulting clients (you share your data, we send back a cleaned version), and founding engineers.

Appendix — nearly 200,000 additional missing labels

It is worth noting that FIXER found an additional 194,582 additional errors, for a total of 468,416, which I omitted from the total due to a quirk of the COCO dataset. COCO contains some images with large crowds of people, such as the example below, where a few people are labelled, and the rest covered by a “crowd” annotation.

A large group of sheep, where 14 are annotated and the rest captured by a larger (dotted) “crowd” annotation.

In instances like this, FIXER is easily able to add the remaining annotations (e.g. the un-annotated sheep above). While these are valid annotations, they aren’t strictly speaking errors, so I omitted them from the headline total. Treating crowd annotations as normal would lead to 468,416 errors, equal to 54.5% of the dataset size.

Technical notes

[1]: We estimated our error numbers by randomly choosing 200 training images, and manually checking each flagged error by hand. This yielded 463 verified errors. Scaling this estimate from 200 images to the full 118,287 images in the training set, produces the given estimates.

[2]: The math here is that 94% of fixer errors are not high confidence, 6% are high confidence, and 94 divided by 6 is roughly 16.