Scaling up Synthetic Supervision for Computer Vision

Toyota Research Institute

Published in

Toyota Research Institute

10 min readSep 22, 2021

By Jie Li, Vitor Guizilini, Pavel Tokmakov, Sergey Zakharov, and Adrien Gaidon

Building off previous blog posts, in this post we will explain:

Why do we need simulation,
What the “sim-to-real gap” is,
How to use privileged information from simulation,
How to combine synthetic supervision with self-supervised learning,
How to supervise the invisible with synthetic data, and
How to leverage differentiable rendering.

The Need For Sim

Large scale, labeled datasets are key to modern machine learning systems. However, continuously building datasets at scale is often a slow, expensive trial-and-error process. Challenges include spurious correlations, harmful biases, and the long tail of unknown unknowns. For some tasks, the ideal dataset is not even possible to acquire in the first place, either for ethical or practical reasons.

An appealing alternative is to use simulators to re-create parts of the world (e.g. Jensen’s Kitchen) and generate synthetic datasets. These can be fully understood, controlled, tested, optimized for efficiency, scaled to arbitrary sizes, and perfectly labeled. On this last point, simulators can even generate labels beyond the annotation capabilities of humans, e.g., labeling forces for fully occluded objects. This is useful to teach machines how the world works, not just what it looks like on the surface. In addition, simulators can easily generate corner case scenarios or rare events that would be hard or dangerous to collect in the real world. Simulators basically enable your supervised ML algorithm to learn in “God mode” vs being micromanaged by a crowd-for-hire.

Although the effectiveness of synthetic data is undeniable, how can we make it even more useful? There has been lots of exciting progress for simulation in autonomous driving and robotics, ranging from re-simulation (VirtualKITTI) and dynamics (DRAKE) to procedural generation of driving scenes (Parallel Domain). But neural networks, whether human or artificial, can still tell the difference between simulation and reality. This domain gap (also known as the “uncanny valley”) limits the real-world performance of machine learning models when training only in simulation.

Overcoming this gap is an important research and practical challenge for the effective use of synthetic data. There are many different exciting directions explored by the community, including by our own robotics colleagues at TRI. In this blog post, we will provide an overview of our recent progress in leveraging photorealistic synthetic datasets for dynamic scene understanding, specifically for autonomous driving.

Closing the “sim-to-real” gap

Domain adaptation is a long-standing topic in both Computer Vision and Robotics. In a nutshell, it aims to overcome the performance degradation due to training on data (i.e. the source domain) that is different from the test data (i.e.. the target domain). When the source domain is from a simulator and the target domain is the real world, then this problem is called “sim-to-real transfer” (see the RSS’20 workshop on this topic). In particular, unsupervised sim-to-real transfer is key for the practical use of synthetic data at scale: how can we learn models in sim when the only real-world data we have are unlabeled videos?

In recent years, adversarial domain adaptation based on Generative Adversarial Networks (GANs) has emerged as a promising family of solutions. These methods alternate the training of two networks. First, GANs learn a generator, a model to increase the photorealism of synthetic images by post-processing them. Second, they learn a critique (i.e discriminator) using real-world datasets to judge whether a post-processed synthetic image looks fake or real. The generator and discriminator are trained against each other, hence the name “adversarial learning”. After convergence, models trained with the “forged” (i.e. adapted) synthetic images transfer better to the real world, as shown in many works like PixelDA and SimGAN. The famous CycleGAN method can even achieve this with unpaired and unlabeled examples!

GAN framework is able to refine or transfer original synthetic images into a style that is close to a targeted real world image set.

Using Privileged Information

Although GANs provide an appealing solution for domain adaptation, it does not leverage the specificity of the sim-to-real gap, namely that the source data is from a simulator! Most domain adaptation methods are indeed using simulators as black-boxes, generating a set of input images and training labels as we would do for dataset collection in the real world. But a simulator is more than a passive sensor. As the creator of the world, simulators know everything regarding how a scene is generated. All this internal knowledge about the world is what we call Privileged Information (PI), a rich source of additional information that could help reduce the sim-to-real gap.

In our ICLR’19 paper SPIGAN, we propose a GAN based unsupervised domain adaptation method that leverages arbitrary privileged information as an auxiliary source of labels. This provides additional guidance to the sim-to-real adaptation process, ensuring that it is consistent with the internal state of the simulator or, in other words, how the world works!

Domain Adaptation through adversarial learning using privileged information from the simulator. We use other sources of modalities P in addition to targeted modality to provide geometry constraints over the adversarial generative progress.

In this work, we take depth information as an example of privileged information, enforcing geometric consistency as an auxiliary task. Our experiments show that this can improve performance by stabilizing adversarial sim-to-real adaptation and decreasing the amount of visual artifacts caused by an unconstrained generator.

Without additional constraint, the generator can conduct image transfer regardless of the physical content, e.g. fake trees in the sky. The use of privileged information from the simulator, such as depth, could help decrease these phenomena.

Synthetic Supervision + Self-Supervision = 💕

In our previous blog post series on self-supervised learning, we discussed how depth and ego-motion can be learned directly from raw videos, without the need for labels as supervision. This is possible because these are geometric tasks that can be learned using prior knowledge about geometry during the optimization stage. In fact, the same approach can also be used to help close the sim-to-real gap in general!

Our recent ICCV’21 paper Geometric Unsupervised Domain Adaptation for Semantic Segmentation (GUDA), demonstrates that geometric self-supervision can indeed lead to representations that generalize better across domains for both depth prediction and semantic segmentation. We can of course learn these two tasks in a supervised way on the synthetic data alone, but it is generally not enough to directly generalize to real data. Instead, we learn a single network for all tasks by combining synthetic supervision with the aforementioned self-supervised depth estimation objective, both on the synthetic and real-world data. Because these tasks share the same encoder features, improvements in depth estimation on the target domain — thanks to the self-supervised objective — benefit the semantic segmentation task as well, even though the network has never seen real-world labels!

Diagram of our proposed multi-task multi-domain GUDA architecture for geometric unsupervised domain adaptation using mixed-batch training of real and virtual samples.

In contrast to GANs or other image post-processing techniques, GUDA does not require making a copy of synthetic images that look more realistic, which might be harder than what we actually want: a model that generalizes well across the sim-to-real gap. Instead, our adaptation is at a representation level, learning directly a single model without the need for additional adversarial networks for image-to-image translation.

Example of depth and semantic segmentation estimates obtained from our multi-task GUDA network trained using geometric self-supervision.

Supervising the Invisible

Another question you may have is whether robots can perceive the invisible? Yes they can! As humans we have been able to do so from a very early age. Remember the peek-a-boo game?

An autonomous car should know that a dynamic object that disappeared behind another objects has not vanished from existence.

Learning object permanence is an early step in an infant’s sensorimotor development that is key to the representation of objects. It is also key for robots. For instance, an autonomous car should know that a pedestrian who disappeared behind a billboard has not vanished from existence, but may in fact reappear, possibly in front of the vehicle! So, how do we teach this concept of object permanence to machines?

As mentioned, beyond augmenting or replacing expensive manual labels, simulation can also provide certain information no human annotator can label easily from sensor data. This is the case for invisible objects present in a scene, because the simulator must keep track of them, even if not rendered to the screen. So a natural question is: can we teach robots to track occluded objects with a synthetic dataset of fully labeled tracks?

In our ICCV’21 work Learning to Track with Object Permanence, we show the answer is yes, thanks to a new video-based model for joint object detection and tracking we call PermaTrack. It recurrently aggregates a spatio-temporal representation of the world, and, as a result, can reason about locations of partially and even fully occluded instances, thus learning its own concept of object permanence. To train this model, we used Parallel Domain’s state-of-the-art simulation platform to generate a new dataset of synthetic videos that inexpensively and automatically provides accurate labels for all objects, irrespective of their visibility. We then use this unique dataset to analyze various approaches for supervising tracking behind occlusions, and propose a simple technique to transfer the resulting model to the real world. Our method outperforms the state-of-the-art on two multi-object tracking benchmarks due to its ability to localize and associate invisible objects.

The Differentiable Rendering Revolution

All the aforementioned examples focus on using simulators to generate synthetic datasets. Thanks to the progress in simulation and sim-to-real transfer, this enables scaling up proven supervised learning methods. However, simulators can go way beyond labeled dataset creation. In fact, simulators can be seen as extremely powerful World Models, much more structured, and hence programmable, than even the best generative statistical models to date.

Better yet, simulators can compute the gradient (derivative) of many simulation operations, including rendering. This matters because it allows us to combine the power of simulation with the main optimization workhorse of Machine Learning: (Stochastic) Gradient Descent. Differentiable Rendering indeed allows learning to program simulators. For instance, we can learn to completely deconstruct and re-render an image via self-supervised learning with a simple “render-and-compare” approach (also called “analysis-by-synthesis”). Starting from an initial random set of scene parameters (e.g., object pose, scale, texture, color), one can iteratively optimize the reconstruction of a target image by simply backpropagating, w.r.t. the scene parameters, the difference between an input image and the output of the rendering algorithm. Once this optimization process has converged, then we know all the underlying scene parameters that can best resimulate (reconstruct) the input real-world image. These inferred scene parameters are the true underlying causes of the image formation process, and hence can be used directly as predictions (an approach called “vision-as-inverse-graphics”) or “auto-labels” for the training of downstream networks (e.g., 3D object reconstruction, human pose estimation, hand pose estimation, face reconstruction, etc.).

Schematic overview of differentiable rendering [Kato, Hiroharu, et al. “Differentiable rendering: A survey.”]

In our most recent works, we focus on some of these practical applications of differentiable rendering. In our CVPR’20 oral SDFLabel, we propose an automatic annotation pipeline that recovers 9D pose (translation, rotation, scale) of cars from pre-trained off-the-shelf 2D detectors and sparse LiDAR data. Unlike our previous CVPR’19 work ROI-10D, where we employed a learned shape space from PCA to predict the shape of cars, SDFLabel uses a differentiable database of normalized implicit shape priors represented as signed distance fields (SDF) based on DeepSDF. These shape priors are learned entirely from synthetic data provided by Parallel Domain. We introduced a novel, differentiable SDF shape renderer based on surfels allowing us to optimize not only for rotation and translation, but also for object shape given the learned shape priors. We show that our approach can recover a substantial amount of 3D labels with high precision, and that these “auto-labels” can be used to train 3D object detectors with state-of-the-art results (supervised via manual annotations on LiDAR point clouds).

Our follow-up work, MonoDR, further improves accuracy without using LiDAR information. It leverages our work on monocular depth and textured 3D object reconstruction. Our experiments confirmed that it is possible to use predicted monocular depth and differentiable rendering in conjunction with learned object priors as a scalable alternative to expensive 3D ground-truth labels or LiDAR information.

Conclusion

Together with self-supervised learning, simulation is one of the keys to scaling deep learning and is the only way to fully control the dataset generation process with scalable human oversight. It is also the only way to safely, ethically and economically collect some types of data critical to machine learning. Although no simulation is perfect, it can still be useful, especially when combined with real-world data. We have discussed here only a few of the many ways TRI is working to maximize the utility of simulation, and we are excited to continue on this scientific journey to fully overcome the sim-to-real gap for safe and scalable Machine Learning in the real-world!