How to Automatically Find and Remove Issues in Your Image, Audio, and Text Classification Datasets

7 min readSep 18, 2023

Finding issues in unstructured data such as images, audio, or text can be difficult and require a great amount of ramp-up and manual effort. However, there are tools that can help you optimize your search for issue clusters using a combination of automatic issue detection and interactive visualization. (Image created by Author using Spotlight)

tl;dr

Combining automatic data issue detection tooling with interactive Machine Learning Data Visualization Tools can significantly shorten the amount of time you spend for recognizing problems in your training and evaluation data.

Use this code for cleaning your image classification dataset:
https://www.renumics.com/next/docs/use-cases/image-classification
Use this code for cleaning your audio classification dataset:
https://www.renumics.com/next/docs/use-cases/audio-classification
Use this code for cleaning your text classification dataset:
https://www.renumics.com/next/docs/use-cases/text-classification

Introduction

I’m sure you know the struggles of training an image classification model and your accuracy metric being stuck and not improving anymore. At least in my first practical projects, that was a common issue I experienced, and back then, I tended to search for possibilities to improve my model for weeks based on tweaking my model architecture, hyperparameters, and training scheme. I can tell you that this was often not successful because, actually, I should have focused on the data.

Especially if you are working on real-world industry use cases, your data will contain a bunch of issues that:

Will confuse your model in training
Will skew your evaluation to being less meaningful

But what are those data issues we are talking about here? In practice, there can be a lot of use case-specific issues, e.g., the camera having a malfunction or the data being duplicated when copying the data on a hard drive. In data science terms, those issues boil down to the following problems we can try to detect in the data:

Outliers, that might be errors which confuse your model or informative edge cases you should take care of.
Label Inconsistencies, which will either confuse your model or mess up your evaluation.
Unwanted Biases, which will cause your model to perform unexpectedly in evaluation or production.
Duplicates, which can potentially make your evaluation look better than it actually is.
… and many more.

How to solve these issues?

So you may now ask yourself, what am I supposed to do to get rid of these issues? My first really important piece of advice here is:

Look At Your Data!

But what does this actually mean in practice? For me, this usually means being able to:

Visualize the raw data, as well as the data going in and out of the model (features and predictions).
Leveraging mechanisms to identify patterns in the correlation between model inputs and model predictions

Doing this for unstructured data can be a lot more bothersome than for structured data. One reason for this is that some data types might require special visualizations, such as spectrograms or audio players for audio data. Another reason is that the data can be large and cannot be kept all in memory, requiring mechanisms such as lazy loading for analyzing and visualizing the data. Also, the unstructured nature of the data makes analyzing and filtering the data challenging and often requires transforming the data to representations that allow for filtering, clustering, and comparing the samples in the first place.

So how do we achieve this fastest? In my experience, it is best to perform two steps subsequently:

Enrich the data with embeddings to make it navigable and automatically detect data clusters where a model performs badly.
Visualize and review the detected data issues while trying to recognize patterns in the data. Those can then be used for further data cleaning.

But how do we actually achieve this in practice? For our tutorial, we will use the two Python libraries sliceguard and Spotlight. sliceguard is a library for detecting data clusters that machine learning models struggle with few lines of code. Spotlight can be used for visualizing unstructured data such as images, audio, or text interactively, recognizing patterns using its rich visualizations.

Concrete code examples for all of these modalities in the next section.

Concrete Examples for Images, Audio and Text

Note: You can find an overview on those and even more code snippets in the Renumics Spotlight Repository. Those are more likely to be up to date if any interfaces change. If you don’t need much explanation just copy them from there or check the tl;dr.

To follow along the example first install all the necessary dependencies by running the following command:

pip install renumics-spotlight sliceguard[all] scikit-learn

Now create a Python script or a new Jupyter notebook and add all the necessary imports:

from renumics import spotlight
from sliceguard import SliceGuard
from sliceguard.data import from_huggingface
from sklearn.metrics import accuracy_score

You now have everything to detect problematic data slices in your dataset, now matter if it contains images, audio files or text. In the sections below I added an example for each of the modalities.

Image Example

Load the image classification dataset by running the following code:

# Load an Example Dataset as DataFrame
df = from_huggingface("Matthijs/snacks")

# DataFrame Format:
# +-------------------+---------+
# |       image       |  label  |
# +-------------------+---------+
# | /path/to/img1.png | popcorn |
# | /path/to/img2.png | muffin  |
# | /path/to/img3.png | cake    |
# | ...               |         |
# +-------------------+---------+

Now, run sliceguard’s automatic issue detection algorithm and visualize the results in Spotlight by executing the following:

# Detect Issues Using sliceguard
sg = SliceGuard()
issues = sg.find_issues(df, features=["image"], y="label", metric=accuracy_score)
report_df, spotlight_data_issues, spotlight_dtypes, spotlight_layout = sg.report(
    no_browser=True
)

# Visualize Detected Issues in Spotlight:
spotlight.show(
    report_df,
    dtype=spotlight_dtypes,
    issues=spotlight_data_issues,
    layout=spotlight_layout,
)

This will generate the following visualization of problematic data clusters which you can explore interactively:

You will be able to review potential issues cluster-wise. In this case, the cluster shows two pictures of waffles in which the model is confused by the rich decorations. A measure could now be to collect more images of this kind to make the model more robust. (Image created by Author using Spotlight)

But what does happen behind the scenes? Besically what this code does is the following.

Sliceguard will first calculate embeddings for the image column to generate meaningful representations for comparing your images.
It will then train a model on those embeddings and the provided labels, subsequently generating predictions for the whole dataset.
Sliceguard will then run a hierarchical clustering algorithm on the embeddings to find groups of images that share similar characteristics, e.g., all images containing not only food but also a person or all images that appear relatively dark.
After that, it will calculate the provided metric (accuracy) for all the found clusters, labeling those as potential issues that are significantly worse than the overall accuracy.

This should, if everything goes well, usually result in an identification of clusters that share similar properties, which many machine learning models will struggle with. As mentioned before, this can have several reasons, such as the data being underrepresented, inconsistent, and many more.

The review of those clusters, as mentioned before, can then be performed in Renumics Spotlight and usually contains the following steps:

Decide if a detected issue is really an issue you should care about, e.g., is there an inconsistency or error in the data or is the cluster simple hard to learn but still within the bounds of what to expect in a production setting.
Try to identify a pattern of why this issue is occurring, e.g., the images are all dark because they were taken in a specific shooting setting that had challenging conditions.
Decide on an action to mitigate the issue, e.g., removing an outlier or collecting more data of an edge case.

Audio Example

Luckily to adapt the code to audio data we basically don’t have to change anything. For our example download an example dataset as follows:

# Load an Example Dataset as DataFrame
df = from_huggingface("renumics/emodb")

# DataFrame Format:
# +---------------------+---------+
# |        audio        | emotion |
# +---------------------+---------+
# | /path/to/audio1.wav | joy     |
# | /path/to/audio2.wav | anger   |
# | /path/to/audio3.wav | joy     |
# | ...                 |         |
# +---------------------+---------+

For issue detection and review simply run the following:

# Detect Issues Using sliceguard
sg = SliceGuard()
issues = sg.find_issues(df, features=["audio"], y="emotion", metric=accuracy_score)
report_df, spotlight_data_issues, spotlight_dtypes, spotlight_layout = sg.report(
    no_browser=True
)

# Visualize Detected Issues in Spotlight:
spotlight.show(
    report_df,
    dtype=spotlight_dtypes,
    issues=spotlight_data_issues,
    layout=spotlight_layout,
)

This will leave you with the following view in which you can again review the detected issues and find patterns that decide for actions:

For reviewing Issues in audio data, you can leverage visualizations such as an audio player and spectrogram. In this case, you can, for example, detect that distinguishing emotions such as fear and anger can be quite hard for the model. (Image created by Author using Spotlight)

Text Example

Same with text. Almost no change needed, just adapt the names of the data and label columns. For the example download the dataset as follows:

# Load an Example Dataset as DataFrame
df = from_huggingface("dair-ai/emotion")

# DataFrame Format:
# +-------+-------+
# | text  | label |
# +-------+-------+
# | text1 | joy   |
# | text2 | anger |
# | text3 | joy   |
# | ...   |       |
# +-------+-------+

Then run the following code for detecting and reviewing issues:

# Detect Issues Using sliceguard
sg = SliceGuard()
issues = sg.find_issues(df, features=["text"], y="label", metric=accuracy_score)
report_df, spotlight_data_issues, spotlight_dtypes, spotlight_layout = sg.report(
    no_browser=True
)

# Visualize Detected Issues in Spotlight:
spotlight.show(
    report_df,
    dtype=spotlight_dtypes,
    issues=spotlight_data_issues,
    layout=spotlight_layout,
)

The generated view will look as following:

For text data, you will be able to review the text along with the labels and model predictions. Note that in Spotlight, you can also render HTML in case your use case requires, e.g., marking of named entities or other visualizations. (Image created by Author using Spotlight)

Conclusion

Going beyond global evaluation metrics such as accuracy can be a valuable tool to gain additional insights on issues present in your dataset, enabling you to derive concrete actions on iterating your data and model. A framework of automatic detection and interactive manual review can significantly speed up this process. Tools such as sliceguard an Spotlight can be used without much setup and come in handy if you want to get started quickly when it is not worth developing your own tooling or you just want to get started.