Rise of Synthetic Data: Welcome to 2020

Chenda Bunkasem
Augustus Intelligence
3 min readJul 1, 2019

Two weeks ago, I attended CVPR, the world’s largest international artificial intelligence conference on computer vision to date. Aside from constant discussion on the ethical implications of facial recognition, I saw a range of both direct and indirect references to synthetic data. If you don’t have the available training data for your algorithm…why not just…make it? Considering it a fairly new topic of interest, those with a passing familiarity of the computer graphics or entertainment industry would have found this solution less than intuitive.

The Argument for Synthetic Data

All machine learning engineers and data scientists in today’s age face the everlasting dilemma of data deficit. The training data available for their algorithms is usually subpar, resulting in suboptimal performance.

More specifically in the case of computer vision, researchers face this time and time again with frustrating effects; COCO provides a nice testing ground for human recognition considering its myriad of human images, but when it comes time to perform detection on lets say — a distinctive piece of jewelry — this fails due to a lack of jewelry examples in the dataset.

Someone at the Lambda Labs installation succinctly emphasized the importance of data at CVPR with this T-shirt, which I now am scouring the internet to purchase:

Why? Because data structures the universe that ML models learn to navigate. As artificial intelligence researchers, we must understand that algorithmic design is only half of the opportunity. Training data is precious. And for your classification stage, it will make or break your performance.

Photorealism Bridges the “Reality Gap”

Of course, the refutation still stands as such: is synthetic data good enough for training? Will it suffice as a stand-in for the real deal?

The answer to this question will not lie in the intransigent thinking of our skeptics, but in practical approaches to a novel solution: domain randomization. Graphical rendering engines now possess capabilities of adding variance to their simulations, enabling there to be a blurring of lines between, well, the real and the fake. Computers will no longer be able to tell the difference between a rendered object and a real object, now allowing researchers a limitless playground to experiment with their machine learning designs.

NVIDIA’s 2080 RTX now performs real time ray tracing with the help of DLSS (deep learning super sampling), and can allow for these synthesized images — and soon animations — to be updated in a procedural manner. Game engines such as Unreal Engine and Unity are leading this venture, displaying the ever-growing eruption of the reality illusion.

AI Gone Rogue

With the rise of Deepfakes and advancement of Generative Adversarial Networks (NVIDIA’s Style-Based GAN), we must also be wary of the sophistication that these synthesized images and objects can reach. BlackHat Europe 2018 featured a talk named “AI Gone Rogue,” which fuels the need for researchers who are aimed at preventing the proliferation of Deepfakes. The 2016 election has already coined a public desire for truth — fake news was one problem, but what happens when fake images are generated to accompany this? What happens when these images turn into video clips?

The lines between the real and fake are blurring rapidly, and we must come to grips with the exciting, yet coincidentally jarring, effects of advancements in computer vision. The question now becomes, are we ready for the capabilities that synthetic data can reach? So long as photorealism becomes a norm for the graphics industry, machine learning researchers need to catch on to this new trend.

Let’s preempt these pitfalls, allowing for a safer and less malleable internet in 2020. Considering the contingency of truth and the impending US election, it is time to face the reality of synthetic data and its powerful capabilities before it’s too late.

--

--

Chenda Bunkasem
Augustus Intelligence

University of London alumni | Artificial Intelligence R&D, Products, and Ethics