Don’t let poor data become your perception system’s kryptonite

Published in

Anyverse™

6 min readFeb 23, 2022

Poor data… The most dangerous villain advanced perception systems developers need to face and defeat if they want to develop an accurate deep learning model.

Surely you have had to deal with them at some point because let’s be honest, you can collect all the data from the real world and still won’t get all the data that you need to train your deep learning models accurately. But what can you do to minimize data gaps? What can you do to avoid ineffective or biased training data?

Little hint, synthetic data may help… a lot.

If your data is poor, your deep learning model is useless

Poor data quality is enemy number one to the use of deep learning for advanced perception development and many other use cases. But why? The answer is common sense, deep learning models demand high levels of quality and data accuracy from very early stages. First, the data used to train the predictive model, and then the new data employed by that model to make future decisions.

To properly train a DL model, data must comply with exceptionally accurate, wide, and high-quality standards.

The data must be right:

Physically correct
Properly labeled
De-dedupedBut it must be the right data for the specific model too:

But it must be the right data for the specific model too:

Unbiased data?
Sensor-specific data?
Enough scene variability?

Most data gatherers focus on one criterion or the other, but for computer perception deep learning models, you have to try to consider all of them simultaneously.

Why? Let’s take as an example the most commonly discussed case currently, self-driving cars. If your self-driving car kills someone on the road, whose fault, is it? You can blame the poor training data quality of your model, But, that won’t help, will it? We are talking about advanced perception applications where reliability is a must, and therefore, the data must be too.

As you can already presume, poor data leads to poor results, inadequate outcomes, and not only that, have you thought about the cost of “bad” data?

The cost of poor data

Bad data can prove to be quite expensive for companies, and not only in economic terms, as we saw in the previous section of this article.

As Rongala A. shares in his research study Cost of Bad Data for Organizations, attempts to quantify the financial impact of poor data have led to some pretty shocking numbers (among others): Gartner states that organizations lose $13.3 Million yearly average on poor data, Cio.com states that 80% of companies believe they lost revenue due to data challenges, CrowdFlower states data scientists spend 60% of their time cleaning and organizing data and Pragmaticworks states 20 to 30% of operating expenses are due to bad data.

In addition to these, you should consider other non-financial costs such as manual labeling or the creation of customized data, time-consuming processes, lacking scalability and adjustability which, ultimately, make you more inefficient and overload your team.

How to deal with poor data and data gaps in machine learning

Now that you know that poor data has the power to ruin your deep learning model, consume too much time from your team, and waste a huge part of your budget, what are you going to do?

There are several approaches you may follow. Its effectiveness or possibility of application will depend on each case and each specific model.

Model complexity

You can build a simple model with fewer parameters. This method will be less susceptible to over-fitting and it is often used to improve classification and prediction.

The problem with this method is that it can be easily called into question. It may work for simpler models and applications, but technologists seem to be working in the opposite direction.

The Real-world is complex, and DL models have no other way but to evolve to be able to understand and interpret more complex scenes.

Transfer learning

Transfer Learning is applied for DL and neural networks, using a pre-built model, which is then adjusted on the small dataset that you have.

You can also reuse already trained neural networks that solve a similar problem to yours, usually leave the network architecture unchanged, and reuse some of the model weights.

This is useful when the new dataset is small and not sufficient to train the model from scratch. But as you can imagine, even if the model you have chosen has already been tested and trained, the new poor data you introduce to the model (and it uses to make future decisions) will affect its performance and results.

Data augmentation

Data Augmentation can help you to make slight improvements to get new images. It takes the pre-existing samples and modifies them to create new and increase the number of training samples. Some data augmentation techniques are scaling, rotation or affine transforms.

These image processing options are often used as pre-processing techniques to make image classification models built using CNN more robust and try to minimize the effect of insufficient data. But still, its benefits to alleviate the problem with poor data are not clear.

Synthetic data

Synthetic data empowers you to artificially generate samples that mimic real-world data and complement your datasets with data difficult (sometimes impossible) to get in the real world, adding custom scene variability or corner cases for example. But not only that, synthetic data (alone) is already being tested to train object detection algorithms with promising results as you can read in the paper RarePlanes: Synthetic Data Takes Flight.

After analyzing these 4 paths, synthetic data seems to be the best-positioned alternative to fight bad data, right?

Can synthetic data bridge the AI training data gap?

Advanced perception system’s deep learning models don’t just learn by themselves, at least not yet. They need to be trained with data with enough information to help them generalize when the system sees new data it hasn’t seen before. And “enough” is the keyword here… Normally real-world data is not enough, and even worse, it can be poorly labeled, making it inaccurate, or it can add bias to the system, causing the model not to perform as expected or leading it to erroneous results.

Synthetic data is not the most popular yet, but it’s going to be the new kid at school soon. According to the last Synthesis.ai Synthetic data survey report: 82% of those surveyed recognize their organization is at risk when they collect “real-world” data, 60% of decision-makers believe that their industry will utilize synthetic data either independently or in combination with ‘real-world’ data within the next five years and 89% believe organizations that fail to adopt synthetic data to train their systems will lag behind.

The need for synthetic data for perception systems deep learning models development is a fact and developers can’t afford to look the other way because their competitors have already jumped on the train and are ready to anticipate and overcome their data gaps.

Don’t let poor data become your perception system’s kryptonite

If your data is poor, your deep learning model is useless

The cost of poor data

How to deal with poor data and data gaps in machine learning

Can synthetic data bridge the AI training data gap?

Written by Anyverse