Cleaning Image Classification Datasets With fastdup and Renumics Spotlight

Daniel Klitzke
6 min readSep 4, 2023

--

tl;dr

Apply our data-cleaning recipe to your own image dataset using these resources:

  1. Data Visualization Tool Spotlight: https://github.com/Renumics/spotlight
  2. Data Curation Tool fastdup
    https://github.com/visual-layer/fastdup
  3. Interactive Demo of Data Issues detected by fastdup loaded into Spotlight:
    https://huggingface.co/spaces/renumics/license-plate-image-classification
  4. Repository/Notebook with all necessary code (clone folder): https://github.com/Renumics/spotlight/blob/main/playbook/stories/cleaning_classification_dataset/clean_classification_dataset.ipynb

Motivation

While common Benchmark datasets will often be already thoroughly curated, in real-world use cases, this is mostly not the case. Typically, you will be facing loads of issues such as outliers and errors, duplicates, and many more.

Real-world datasets contain many different issues such as blurry or dark images. (Image created by author using Spotlight)

In this tutorial, we want to show you how to clean your image classification dataset efficiently. We will be addressing the following issues:

  1. Common image-specific issues like bad exposure or blur.
  2. Outliers in the data that can mess up your training.
  3. Duplicates that can distort your evaluation results.
  4. Inconsistent labels that add noise to your training.
  5. Finding clusters that give you additional insights.

The Tools

For our tutorial, we will be using two open-source libraries, fastdup and Spotlight. But what are those libraries and why are they a good fit for being used together?

fastdup is an open-source library for scalable data curation, offering high-quality detection algorithms for uncovering the most common data problems. It is great for quickly uncovering the most severe issues in your image dataset. Also, compared to other tools it is very fast doing so.

fastdup will give you a list of common data issues — fast. (Image created by author using fastdup)

Spotlight is a tool for interactively exploring (unstructured) datasets and evaluating machine learning models. It is great at working with multimodal datasets and uncovering (failure) patterns very efficiently.

Spotlight lets you explore detected issues in the context of the whole dataset, uncovering complex patterns through its interactive visualization capabilities. (Image created by author using Spotlight)

Together, those tools bring you the best of two worlds. fastdup brings you an automatic and fast detection of common dataset issues. Spotlight complements this by allowing for an interactive exploration of these detection results, uncovering failure patterns and challenging scenarios.

Performing the Data Cleaning

To follow along or apply our data-cleaning recipe to your image dataset, check out this folder of the Spotlight Github Repository. The notebook contains everything you need to adapt this tutorial to your own data in minutes.

The dataset containing license plates of different US states you need for following along is available via Kaggle.

Detecting Image-specific Issues

Image datasets are subject to image-specific issues that can, e.g., occur when taking images under challenging environmental conditions. Examples of these are dark or blurry images.

To detect those and also the other issues automatically using fastdup, we need only few lines of code:

fd = fastdup.create(input_dir=INPUT_DIR)
fd.run(annotations=df) # Detect data issues using fastdup
_, embeddings = fd.embeddings() # Save the generated embedding to variable

Fastdup will do two important things here:

  1. Scan the data for all kinds of issues such as outliers, duplicates, and many more.
  2. Calculate embeddings that can be used by Spotlight for in-depth analysis of the detected issues.

fastdup also offers the possibility to generate static HTML reports with a oneliner:

fd.vis.stats_gallery(metric="blur")

Which leads to the following result:

Static fastdup HTML reports already give you an overview on the most severe issues in the dataset.

But what can we additionally do using Spotlight? Basically, Spotlight will help you uncover patterns that are not visible if you go through the detected issues in a list view. It will help you to view the issues in the context of the whole dataset with a few lines of code:

spotlight.show(dataframe,
dtype={"embedding": spotlight.Embedding, "filename": spotlight.Image},
issues=issues,
layout="layout.json"
)

You will then be able to further investigate on present patterns interactively using visualizations such as histograms and dimensionality reduction plots:

Spotlight will let you uncover patterns behind the single detected images by putting them into a bigger context.

For example in this case Spotlight can help you answer the following questions:

  1. Where are clusters of images taken under challenging conditions?
  2. Are the challenging conditions associated with specific classes?
  3. Are there any conditions the scalar features do not sufficiently capture?

Detecting Outliers

Another typical data problem is outliers in the data. Those can be either errors, e.g., caused by a broken camera setup, or edge cases that are simply underrepresented but important cases that you want to include in your training.

With fastdup again, getting all the outliers is as easy as calling:

df = fd.outliers() # for getting a dataframe

fd.vis.outliers_gallery() # for getting a static report

Visualizing the data in Spotlight will provide you with the following view:

Outliers visualized in an Interactive Spotlight view. (Image created by author)

Spotlight can help you investigate questions such as:

  1. How are the outliers distributed across classes
  2. Where are clusters of outliers that share similar properties
  3. Are outliers fastdup detects in the image data explainable via metadata you might have

Detecting Duplicates

Also, duplicates are something you will find frequently in real-world datasets. Those can cause a number of problems, including skewing your data distribution if they are too frequent and distorting your evaluation results towards being too optimistic. This happens if duplicated samples are present in your train and test split. Note that there are two types of duplicates you might deal with:

  1. Exact duplicates you could catch, e.g., by using hash functions.
  2. Near duplicates that are not identical but very similar to other images, you can catch by using embeddings.

fastdup can show you potential duplicates as follows:

fd.similarity_gallery() # for getting a list

fd.vis.duplicates_gallery() # for getting a static html report

Visualizing the results in Spotlight will look as follows:

Visualization of exact and near duplicates using Spotlight.

Spotlight will help you answer questions such as:

  1. Are there data slices containing a large number of duplicates?
  2. Can you manually identify larger clusters of near duplicates?
  3. Are certain metadata attributes explanatory for certain types of duplicates?

Detecting Label Inconsistencies

Label Inconsistencies are also very frequent in real-world use cases and can be a real pain that is messing up your training and evaluation. In fastdup, you can try detecting some of them by looking at very similar images that have different labels. See the tutorial notebook on how to do this.

Again, after automatically generating those issues using fastdup you can review them using Spotlight:

Visualization of potential Label Inconsistencies using Spotlight.

Spotlight will help you answer questions such as:

  1. Are the detected label inconsistencies true inconsistencies?
  2. Are the label inconsistencies especially present in certain clusters or classes?
  3. Are there ways to filter or correct inconsistencies automatically?

Detecting Clusters

This section is not necessarily about detecting issues in your data. However, finding clusters and, by that, getting a feel for the structure of your dataset can help you make informed decisions during your machine learning experimentation process and beyond. Also, there are certain issues that you can uncover using cluster analysis. One example of such an issue is an unwanted bias towards a certain type of license plate.

fastdup will show you clusters it found in the data as follows:

fd.connected_components() # for getting a list of clusters

fd.vis.component_gallery() # for visualizing clusters

Interactively analyzing clusters in Spotlight looks as shown below:

Visualization of clusters in the data using Spotlight.

Here, Spotlight can simply help you build an understanding of the dataset in a really intuitive way, using structured as well as unstructured data.

Conclusion

fastdup and Spotlight can be a powerful combination. Especially the combination of automatic and fast detection without investing much time yourself, combined with the possibility to generate insights on failure patterns if needed, is something you want to leverage in your machine learning training process.

If you want to learn more, definitely check out the Spotlight Repository and the fastdup Repository on GitHub.

--

--