Trash Bins as a Cheap Alternative to Data Cleaning

Dan Erez
Taranis Tech
Published in
3 min readJun 30, 2024

Data

In the world of AI, you quickly learn — Data is King.

Most people interpret that to mean — a lot of data is king.

But data is defined by much more than quantity. We typically consider:

  1. Quantity
  2. Labeler consistency (do the labels draw the line between classes consistently, use the same bounding box size for similar objects …etc)
  3. Label unambiguity (well-defined classes / subclasses)
  4. Information density (bounding boxes are more informative than classifications for example)
  5. Class balance
  6. Variety / robust representation of the problem space
  7. Other things I’m not thinking of right now 🙈

Generally speaking — a smaller dataset that is well-curated will outperform a huge dataset in accuracy — but it will also allow for much faster / cheaper training and experimentation.

Miss-Labels

One of the biggest problems datasets face is hidden mislabels — missing labels, or labels with the wrong class/location.

The obvious go-to here, the basis of the data-centric approach, is to search for and fix these issues in the data.

This is super effective and consistently improves model quality — sometimes by significant leaps.

However — it’s expensive, time-consuming and frankly it’s annoying sometimes.

A common alternative \ supplement to this — is to look for data the model had a hard time with (hard example mining) in order to reduce the influence of these bad examples and refocus the model.

Trash Bins

What we have found is that trash bins supercharge this tedious effort.

Trash Bins are explicit classes of mistake prototypes that the model performs most often.

By explicitly annotating common mistakes, even if the model didn’t make an error in a particular datapoint — we can leverage one of the core characteristics listed above to drastically improve the model with minimum effort. Remember — it’s not just data quantity. It’s the quality.

So by explicitly teaching the model (hopefully) well-defined mistake classes, we can:

  1. Give it way more examples from our existing data without much work, additional inference …etc (Almost every image has the potential to be confused / miss inferred).
  2. We provide the model with a much denser information stream (rather than it having to implicitly learn what potential mistakes are lurking).
  3. We counter the very small (but very problematic) signal that bad gt labels produce since we create a much stronger alternative signal for the model to generalize to in those cases.

Our implementation was super simple.

  1. Train a baseline model.
  2. Review the model results.
  3. Group the model’s false predictions into 5–10 common issues (trash bins).
  4. Run a small subset of the images through annotation with these trash bin labels.
  5. Retrain the model while adding these new classes.
  • In our case we were using segmentation maps — so we could add these classes directly to our segmentation model (with some simple logic to deal with overlaps …etc).
  • If you are using classification — you can use a multi-label approach or a second classifier for example.

Result: The model’s false alarm rate dropped by a drastic amount.

Naturally, once your model has learned what these mistakes may look like you have a much more direct way to do hard example mining as well as identify your incorrect GT labels.

--

--