With the upcoming CVPR conference, we thought it would be useful to highlight an emerging trend in computer vision, synthetic data. Synthetic data is information that is artificially manufactured rather than generated by real-world events. Synthetic data is not limited to visual data but exists for voice, entities, and sensors (LIDAR, radar, and GPS). We delineate synthetic data’s value below and categorize 45 offerings. We are excited about innovation in the space and look forward to speaking with synthetic data startups.
With the advancement of off-the-shelf training frameworks like TensorFlow and PyTorch it is easier to build Machine Learning (ML) models than ever before. Unfortunately, data remains ML’s cold start problem. Often, companies can’t acquire enough data within a given time frame to build highly accurate models. Additionally, large companies like Google have massive data moats that are hard to penetrate. Today, businesses that are capturing data are hand labeling it, which can be slow, costly, and low quality. Synthetic data helps businesses bypass these constraints, democratizing data.
Synthetic data has multiple benefits:
- Decreases reliance on generating and capturing data
- Minimizes the need for third party data sources if businesses generate synthetic data themselves
- Can be cheaper and faster than hand labeling (check out piece on data labeling here)
- Can produce data that is difficult to capture in the wild (e.g. underwater or military conflict zone visual content)
- Can generate data that occurs infrequently in nature but is critical for training (e.g. edge cases)
- Produces high volumes of data
- Offers perfectly labeled data
- Supports faster labeling iteration
- Diminishes privacy concerns
This piece will mostly focus on visual synthetic data that comes in two main forms: 1) photorealistic data and 2) programmatically created data. Photorealistic data is produced by artists and is intended to look as much like reality is possible. The process for generating photorealistic data is longer than programmatic techniques.
Programmatic synthetic data can be created by using gaming engines like Unreal, Blender, and Unity. Then procedural systems, like Houdini, are used to accelerate the creation of assets. Next teams can use techniques like domain adaption using Generative Adversarial Networks (GANS) or domain randomization to increase the permutation of the data.
Domain adaption is the task of classifying an unlabeled dataset (target) using a labeled dataset (source) from a related domain. It allows teams to take low quality synthetic data and real data to make synthetic data better.
Domain randomization also helps decrease the reality gap. According to Nvidia’s paper, “domain randomization intentionally abandons photorealism by randomly perturbing the environment in non-photorealistic ways to force the network to learn to focus on the essential features of the image.” Adjustments to the data can include image scene, lighting position and intensity, texture, scale, and position. Instead of training a model on one simulated data set, teams randomize the simulator to expose the model to a wide range of permutated data (exhibit below). This is quickly becoming the most popular technique as it has a low bar of entry.
Within domain randomization is a sub-category called guided domain randomization. This research area focuses on automatically creating the randomizations instead of manually designing them, which can be tedious. The ability to programmatically create synthetic data further accelerates time to value.
Businesses can choose between using third party vendors that provide synthetic data or building their own internal teams. We’ve heard it is very hard to identify and hire individuals with the right mix of technical art, game development, and ML expertise. When teams decide to leverage synthetic data, we hear they are blending synthetic and real data together for training. Often the ratio is 80%-90% synthetic to 10%-20% real.
Academic research is working on techniques to create synthetic data that can represent 100% of the training data and create models with the same level of accuracy as models trained on real data. Currently, cross-domain applications are where synthetic data shines. For example, if you are an autonomous vehicle company building a car that will drive in San Francisco and Tokyo you will want training data from both sites. Perhaps you don’t have access to Tokyo data. If you solely trained on San Francisco data and then ran the vehicle in Tokyo, its performance would be worse than if you complemented the real San Francisco training data with synthetic Tokyo data.
Most of the synthetic data today suffers from an “reality gap,” which is when it does not appear realistic. In turn, it is rare that synthetic data applied to training within a domain can perform the same or better as real data from the domain. Within a domain, synthetic data can be challenged because it often needs to contain physical behavior like gravity and inertia. Accurately mirroring physics principles is hard, but gaming engines are making progress.
There is advanced academic research coming out of Berkeley, OpenAI, and NVIDIA that is pushing forward the ability to only use 100% synthetic data to generate highly accurate models. For example, an OpenAI paper built a data generation pipeline using domain randomization to synthesize objects. The robot grasping model generated from 100% synthetic data achieved a >90% success rate on grasping previously unseen realistic objects.
Even blending together different types of synthetic data can have a positive impact. A NVIDIA paper found that blending domain randomized and photorealistic data generated an object pose estimation model that was able to competitively perform against a state-of-the-art network trained on a combination of real and synthetic data. We haven’t come across any business that has successfully used 100% synthetic data to build highly accurate models running in production.
Use cases for synthetic data are wide ranging. For computer vision applications, the most common use cases for synthetic visual data are autonomous systems (AV, robotics, and drones), agtech, real estate, video surveillance, CPG, retail, and defense. The use of synthetic entity data has been catalyzed by privacy concerns since it can strip out names, emails, social security numbers, etc. but will still mirror the underlying data set. This helps data scientists perform experiments without accessing sensitive information. We’ve seen synthetic voice data utilized in media production.
We categorize 45 synthetic data solutions across six categories: 1) tools, 2) sensor (camera, LIDAR, radar, and GPS), 3) entity, 4) voice, 5) forensics, and 6) products / avatars leveraging synthetic data. We appreciate our exhibit below is not comprehensive but highlights some of the more well-known offerings in the space.
Our exhibit includes products, like media production, that leverage synthetic data. Over the past few months there has been a wave of “deepfakes,” which are videos or audio that present something that didn’t actually occur. For example, Lyrebird can replicate Trump’s voice.** Synthesia’s recent video of David Beckham speaking against malaria used ML to generate the content. There are now deepfakes of Elon Musk, Salvador Dali, and Barack Obama.
Deepfakes are a rising concern because they can often be nearly indistinguishable from reality. McAfee, Symantec, and academia are working on forensic techniques to detect deepfakes. A Black Hat 2018 paper from Symantec describes a spot fake videos based on Google FaceNet. The University of Albany introduced software that could identify a deepfake video by analyzing how often the simulated faces blinked. In the future, we believe synthetic audio and visual content will be watermarked to avoid confusion.
Synthetic data is a rising trend in the ML and data science community. Synthetic data exists across voice, sensor, and entity data. It presents many benefits compared to data labeling techniques including speed, cost, scale, and diversity. There are a few vendors offering synthetic data as a service and others leveraging it to improve media production. With the emergence of deepfakes, verification of real vs. synthetic content will be needed. This field is nascent but rapidly evolving. If you are working on a synthetic data startup, we would love to talk to you.
Special thanks to Javaughn Lawrence, Josh Tobin, and Jonathan Tremblay.
** Redpoint is an investor in Lyrebird.