Introducing VL Datasets

Open and Free Clean Image Datasets - A Game-Changer for Generative AI and Beyond.

Published in

Visual Layer

5 min readJun 19, 2023

AI-generated image by chandlervid85 on Freepik.

TL;DR: Generative AI models heavily rely on clean, large-scale image datasets. However, issues such as duplicates and mislabels plague existing datasets like LAION-1B. To combat this, we introduce VL Datasets, a collection of clean datasets for Visual AI applications, created using Visual Layer’s newly introduced VL Profiler. We are releasing a cleaned-up version of some of the most popular visual datasets such as LAION-1B & ImageNet-21K. These datasets, we dub VL Datasets are accessible for free via the visuallayer Python SDK or via the VL Profiler UI, aiding in more robust and reliable AI model development.

Introduction

Generative AI, a technology that revolutionized the fields of art and design among others has gained significant popularity in recent years. This technology, however, heavily relies on high-quality, large-scale image datasets to train robust models. Whether you’re a researcher, student, data scientist, or an AI enthusiast, you’ve likely encountered challenges with messy image datasets in Generative AI or other AI applications involving visual data.

These challenges, such as duplicated images, mislabeled images, or outliers, can significantly impact the reliability of your models, waste storage and compute resources and take up valuable time and effort to clean up manually ¹.

Revisiting The Problem

Handling large-scale image datasets is complicated. Common issues such as mislabels, duplicates, and outliers can compromise up to 46% of a dataset, leading to suboptimal models and wasted resources ¹.

Despite the critical role of these datasets in diverse applications, the tools for managing visual data often lack the necessary features, and teams are compelled to write their own custom tools ¹ .

We analyzed several open-source widely used academic datasets such as ImageNet-1K, ImageNet-21K, and LAION-1B and found that the datasets are plagued with various quality issues.

In LAION-1B we found that nearly 105M of the images have quality issues. For instance, more than 90M of the images have duplicate copies, more than 7M images are blurry or low quality and more than 6M images are considered outliers.

LAION-1B issues: There are 2.4M duplicated broken images.

LAION-1B: There are 72.5K almost exact duplicates of striped images.

LAION-1B: There are approximately 235K images that are completely dark, completely bright, or only contain a single color.

LAION-1B: There are approximately 900K outliers found showing strange objects and scenes.

VL Datasets: Redefining Image Datasets

To address these challenges, we introduce VL Datasets — a collection of clean computer vision datasets.

VL Datasets are designed to minimize, if not eliminate issues such as duplicates, mislabels, outliers, data leakage, and more, transforming the way you engage with your datasets and ultimately, improving your model’s performance ².

Among the VL Datasets we’re proud to announce today is a cleaned-up version of LAION-1B, a dataset that is widely used for training generative AI models such as Stable Diffusion and Midjourney.

The following table summarized the issues found by analyzing various commonly used datasets.

View a detailed breakdown of issues for each dataset on our documentation page and download the CSV files listing all problematic images for free.

You can view a detailed breakdown of issues for each dataset on our documentation page and download the CSV files listing all problematic images for free.

How were VL Datasets Created?

VL Datasets were created by the Visual Layer flagship tool — VL Profiler. This advanced tool, based on usage from 220K users worldwide, utilizes a graph-based engine to identify and cluster data anomalies and quality issues, eliminating the need for tedious individual handling ³.

Datasets issues displayed through VL Profiler: On the right top is the quality score, a score of 100 is perfect quality.

If you’d like to try VL Profiler on your own dataset and uncover potential issues, sign-up for free or visit our documentation to learn more.

Getting Started with VL Datasets

You can access VL Datasets using the open-source visuallayer Python SDK OR using the VL Profiler, a user interface that runs in the browser.

To install the SDK, run:

pip install visuallayer

To load a dataset you’ll only need 2 lines of code:

import visuallayer as vl
dataset = vl.datasets.zoo.load('vl-imagenet-1k')

Once a dataset is loaded you can view the metadata with:

dataset.info

Metadata:
--> Name - vl-imagenet-1k
--> Description - A modified version of the original ImageNet-1k dataset removing dataset issues.
--> Homepage URL - https://www.image-net.org/
--> Number of Images - 1431167
--> Number of Images with Issues - 15997

Get a report showing a breakdown of the issues:

dataset.report

| reason     | count | pct   |
|------------|-------|-------|
| Duplicate  | 7522  | 0.565 |
| dark       | 3174  | 0.238 |
| blur       | 2478  | 0.186 |
| Mislabeled | 1480  | 0.111 |
| Outlier    | 1199  | 0.090 |
| leakage    | 869   | 0.065 |
| bright     | 770   | 0.058 |
| Total      | 17492 | 1.313 |

You have 2 options to explore the issues in the dataset.

Option 1 — Exploring using the VL Profiler which offers extensive visualizations of your dataset and issues.

VL Profiler — Issues found in the MS-COCO dataset.

Option 2 — Exploring in Jupyter notebook using the SDK.

If you’d like to explore the issues in Jupyter notebook run:

dataset.explore()

This outputs an interactive table that lets you view, search, and sort the issues as shown below.

Exploring the issues in Jupyter notebook.

You can also export the dataset for training in PyTorch:

# Export to PyTorch
train_dataset = dataset.export(output_format='pytorch', split='train')

# Pytorch training loop
...

View a complete example from loading a dataset to training in PyTorch and fast.ai.

We created VL Datasets to be free and open-source as a way of giving back to the community. VL Datasets can be used as a starting point to train your machine learning model on clean datasets.

Conclusion

In the rapidly evolving field of AI, we believe that resources like VL Datasets are critical for driving advancements in Generative AI and other applications. By making the management of large-scale image datasets more efficient and effective, we hope to pave the way for more robust and reliable AI models.

We invite you to join us in our journey to enhance the quality of large image datasets and unlock the full potential of your data. To get started visit our GitHub repository or join our Slack channel for any enquiries.