Bringing Synthetic Data into Computer Vision Training Pipelines

Yaser Khalighi
SceneBox
Published in
7 min readAug 2, 2022

TL;DR: Synthetic data is helping fuel the future of computer vision ML and should be first-class citizens in any data management solution. At SceneBox, we are working with our partners at Synthesis AI to enable our customers to streamline their training workflows with combinations of synthetic and real data.

Close-Up Humans Dataset by Synthesis AI

The value of synthetic data is real

Across many verticals, we are seeing the movement towards the use of synthetic data for vision-based perception and autonomous systems. Synchronously, we are seeing new funding for synthetic data companies and an increased research effort within the community.

The trend toward synthetic data is real because it is much more cost-efficient, can be quickly generated to precise specifications, and it contains perfect ground truth labeling.

Traditionally, ML practitioners working on vision-based perception problems had to collect large volumes of real-world data, organize this data and curate a training dataset, have the data labeled, then train their models on this data. This method is not only much more expensive but significantly more time-consuming — burning through resources faster while moving slower.

There is a strong case for synthetic data and it raises the question — is synthetic data going to replace real-world data entirely?

In the long term as synthetic data, technology improves this is very possible, but over the next few years, we believe synthetic data is on its way to making up 80% of data used in ML training. However, the remaining 20% — and likely even more if the application is mission-critical — will still be real-world data.

Why?

The answer is in engineering history books

Let’s look at an example from engineering history that might help us understand where we are and where we are going. Take a highly regulated, research-intensive, mission-critical industry such as aviation for example.

In the 60s and 70s when aerospace engineers were designing new aircraft fuselage and wings, they had to do dozens or hundreds of expensive wind-tunnel testing for aerodynamic analysis. As you can imagine, these tests were costly, involved, and time-consuming. With the advent of computer simulations, the majority of the testing quickly moved to the digital world — reducing testing costs dramatically and rapidly hastening development and innovation broadly. However, even with the aforementioned benefits, computer simulations could not replace real-world testing entirely. To this day, aerospace engineers still do a handful of final testing in the wind tunnel, as there are many complexities in real-world physics that a computer simulation cannot capture.

In the same vein, aircraft pilots spend many hours training through flight simulators — but still, log hundreds of hours in a real aircraft. Just as flight hours are preferred to flight simulator training, synthetic data will never completely replace real-world data.

Similar to our aerospace engineers and pilot training examples, today’s ML practitioners are beginning to understand that real-world ML training data can be supplemented with, but cannot be replaced by, synthetic data.

Real + Synthetic = Optimal training data

Synthetic data is important — but it will coexist with real data, not replace it entirely in the coming years. Before we usher in a new age of synthetic-only ML training data, we in fact appear to be entering into a golden era of ML perception innovation and development. Never before have we had so many tools at our disposal, and we are still only scratching the surface of harnessing the power of data — real or otherwise.

Synthetic and real data should be managed together.

In order to optimally use all of the data tools at your disposal, your data management solution should extend to, and benefit from, synthetic data engines. ML practitioners are using both, so why shouldn’t tool providers do the same?

Currently, ML practitioners use data management solutions (such as SceneBox) to find gaps in their training data, characterize these gaps, then find data to fill the gaps in either pre-existing data lakes, or by collecting new real data based on the specific requirements they have characterized.

In the present and into the future, now with the power of a synthetic data engine, we can simply generate new synthetic datasets using those same insights and characteristics. Easier yet, simply search a seemingly endless amount of already existing synthetic data in a repository to fill these data gaps.

With the importance of synthetic data and its harmony with data management solutions in mind, here is a couple of example workflows that integrated platforms such as SceneBox can help with:

Workflow 1: Identifying and filling data gaps with synthetic data

In this example, we have a computer vision engineer training a model and evaluating it on a real data set. From within the real dataset, the engineer uses a data management platform to find edge-cases and gaps in the training dataset — for example, they identify the model exhibits low precision in identifying people with long hair wearing masks and headphones in the fisheye camera. Given this insight, they go to a synthetic data engine to generate hundreds of examples of people with long hair, wearing masks and headphones, and enrich the training dataset with this data. Bonus: there is no data privacy issue!

Workflow 2: Finding labeling noise in training data

In this example, you have a workforce that has labeled thousands of frames collected from real cameras. The problem is that if the labelers mislabel the data, the model won’t be trained correctly. So, how do you know if the data is mislabeled without a massive manual review?

One way to test for mislabeled data is to mix synthetic data with real data and use the perfect ground truth labels that are inherent to synthetic data to check the accuracy of the human labels. By using a sample of perfectly labeled data, we are able to quantify the label noise.

A partnership between Synthesis AI and SceneBox

Closeup Humans Open Source Dataset on SceneBox

We have been working with our colleagues at Synthesis AI to bring the power of synthetic data to the fingertips of our customers. As a first step, SceneBox is hosting the Close Up Humans Dataset, a synthetic, open-source dataset generated by Synthesis AI. You can check it out here.

Visualizing Unique, Comprehensive Labels in Synthesis AI’s HumanAPI

With every image, Synthesis AI datasets provide an expanded set of pixel-perfect labels including segmentation maps, dense 2D/3D landmarks, depth maps, surface normals, and more. Synthesis AI’s customers use the JSON-based API to reduce bias in their models associated with imbalanced data sets while preserving privacy. By inspecting the data on SceneBox, customers can find these imbalances and request new data from Synthesis AI’s API to ensure equal representation across identities, facial attributes, pose, camera, lighting, etcetera.

Easily survey the different styles of metadata available by toggling landmarks and segmentation
Examine masks, metadata, and more one SceneBox

Exploring Bias in Datasets

SceneBox’s integration with Synthesis AI’s HumanAPI also makes exploring distributions in datasets as simple as a few clicks of the mouse. For example, skin tone is an important and common bias factor for human-centric computer vision models. Synthesis AI’s HumanAPI allows you to generate millions of combinations of the environmental, camera, and character combinations, all with a certain skin tone range if desired. With the SceneBox-Synthesis AI integration, you can easily filter the dataset by different Fitzpatrick skin tones, or write your own custom filter.

Filter data by any metadata to uncover bias

Another common bias is that of camera angle. We’ve found that the majority of real datasets only provide subjects at a limited degree of camera angle; much different than the actual use case at hand. With the SceneBox-Synthesis AI integration, you can easily filter the dataset by camera angles; for example, simply click through to the camera_db and filter to the camera angle, to see that 284 images out of 10,000 are at both extreme yaw and pitch angles. Of course, with Synthesis AI’s HumanAPI, you can request as many extreme camera angles as you would like!

Only 284 of 10,000 images are at extreme angles for both yaw and pitch

What the future holds between SceneBox and Synthesis AI

The teams at Synthesis AI and SceneBox are both excited to share the first step in our partnership and hope that you find benefit from it. In the long term, we envision helping our users easily identify biases in their datasets and generate new data with the ease of clicking a button! Don’t hesitate to get in touch with Synthesis AI or SceneBox with any questions, as well.

--

--

SceneBox
SceneBox

Published in SceneBox

DataOps for computer vision. Designed to find game-changing data in enormous datasets, faster than ever

Yaser Khalighi
Yaser Khalighi

Written by Yaser Khalighi

ML technologist, ex-Founder at SceneBox