Data issues in most available computer vision datasets you need to solve

Published in

Anyverse™

5 min readJan 24, 2024

There are several data issues or problems associated with most available datasets for computer vision you need to take care of if you want to generate data that improves your visual perception model’s performance.

As we introduced in the previous article of this datasets for computer vision article series, some of these issues apply regardless of the nature of the dataset, others will apply to datasets obtained from real-world data, and others to synthetic datasets.

On the other hand, the source and technology from which data has been captured or generated and subsequently processed largely determine the appearance of many of the problems we address below.

Common data issues in datasets for computer vision

1. Limited size and diversity

Many datasets are limited in size and diversity. This limitation can affect the ability to train models that generalize well to real-world scenarios.

2. Perception domain gap

One significant challenge in dataset creation is addressing the domain gap for perception tasks. This refers to the disparity between the specific characteristics of the sensors used to capture the data in the dataset and the sensors the perception system will use in production which can differ greatly. The system’s deep learning model may struggle to extract features because of the different sensor characteristics.

Bridging this domain gap is essential to ensure that AI systems can handle the full spectrum of scenarios they might encounter.

3. Content domain gap

The content domain gap pertains to the diversity of content represented in datasets and the disparity between the conditions in which data is collected and the conditions in which a perception system is expected to operate. For example, a facial recognition dataset primarily composed of images of one ethnic group may perform poorly on faces from underrepresented groups, or an autonomous vehicle trained on data from a well-maintained city may struggle to perform in a rural area or extreme weather conditions due to the content domain gap.

Addressing this gap involves creating more inclusive and diverse datasets that better reflect the real-world content AI systems will encounter.

4. Lack of corner cases

Detecting corner cases is directly connected to the previous point. These are critical situations or environmental conditions with a low probability of occurrence, but a very high percentage of fatal accidents, system malfunctions, and false positives. Training a model with insufficient data and variability will cause the final autonomous system to not be robust or trustworthy in situations for which it has not been properly prepared.

RGB Color image generated synthetically by Anyverse

5. Data annotation quality

Crowdsourced annotations, while a cost-effective option, may not always provide high-quality annotations, necessitating significant post-processing and cleanup efforts.

6. Realism of some synthetic datasets

While synthetic data offers an alternative, it often struggles to capture the full realism and variety of real-world data, especially data created from game engines. This limitation can impact the generalization of models trained on synthetic data to real-world inputs.

7. Challenging environmental conditions

Datasets may not always represent the full range of challenging environmental conditions encountered in real-world scenarios. For example, issues like direct lighting, reflections from specular surfaces, fog, or rain may not be adequately simulated in some datasets.

RGB Color images generated synthetically by Anyverse

8. Generalization to rare situations

It can be difficult to capture rare situations in datasets, and such situations may only be recorded by a large fleet of vehicles logging data in real-world driving. This poses challenges for testing and developing AI systems to handle rare scenarios.

9. Difficulty in acquiring ground truth

Acquiring accurate ground truth data, especially for tasks like optical flow, can be challenging. Some methods involve time-consuming procedures, sparse pixel-level annotations, or are limited to controlled lab environments.

10. Limited real-world complexity

Some datasets may lack the complexity found in real-world scenes, such as complex structures, lighting variations, or shadows.

11. Pre-training on synthetic data

While synthetic datasets are useful for pretraining models, they may not fully prepare models for real-world scenarios. Finetuning on smaller, more realistic datasets is often necessary.

How do you fix data issues in your datasets for computer vision?

Addressing data issues within organizations can be a complex task, as there’s no instant remedy. To effectively tackle data quality problems from the root, it’s essential to give it priority in the organizational data strategy. Subsequently, engaging and empowering all stakeholders to actively contribute to maintaining data quality becomes a crucial step.

Equally critical in this process is the selection of data generation tools. Opt for tools equipped with intelligent technologies to not only enhance data quality but also unlock the full potential and value inherent in the data.

Stay tuned to explore the most popular datasets in computer vision and autonomous driving

In the next article, we will have an overview of the most popular datasets in computer vision and autonomous driving available in the market. Don’t miss it out!

Data issues in most available computer vision datasets you need to solve

Common data issues in datasets for computer vision

How do you fix data issues in your datasets for computer vision?

Stay tuned to explore the most popular datasets in computer vision and autonomous driving

Written by Anyverse