Overcome the simulation gap in 3D Deep Learning with Domain Randomization

Published in

Vitalify Asia

5 min readJun 27, 2019

A case study on using only simulated data to train a Deep Learning model that can robustly generate 3D object from just a single real image.

The data challenge

High quality labeled data has been a major driving force for the success of supervised learning. Alongside the advancement of ever deeper and better model combines with vast amount of computational resources, labeled data has became the bottleneck of DL’s progress. This problem is even more challenging in the field where aligned, labeled data is especially hard and expensive to obtain, such as robotics or 3D deep learning.

Fig. 1: Naive rendering from ShapeNet dataset.

3D deep learning is a very active subfield over the recent years with rapid advancement. However most of these works only focus on large synthetic 3D dataset such as ShapeNet.

Some effort have been made regarding the lack of real dataset, such as Pix3D. However, due to the time and effort required, such dataset is very limited — e.g 100 different objects for each categories in Pix3D, compare to several thousands from synthetic ShapeNet dataset. This problems make the process of transferring model to custom objects also very challenging.

The performance gap

Fig. 2: State-of-the-art 3D reconstruction model when train on naive render of ShapeNet (Fig. 1)

Figure 2 shows the reconstructed models of current state-of-the-art (SOTA) model on real images. This model is trained on the naive rendering of ShapeNet data (Fig. 1), and it works reasonably well on real images on white and clean background (1st row). However it fails miserably on image with just a little more challenging background (2nd row).

Clearly there is a large performance gap between synthetic and real image for model that is trained on synthetic data only.

Related works

Currently there are a few research directions to overcome this performance gap between real and synthetic domains:

Build better simulation:

If you have a simulation that is very realistic, surely the model that works in the simulation will works in real life too. However such simulation is very hard to build and is a quite an achievement in and of itself. AI Habitat + Replica is a recent work in this direction.

It is also very likely that the model train in simulation will overfit to the slightest discrepancy between simulation and real world.

Domain adaptation:

This approach does not aim to create a perfect simulation to the real world, it only need a sufficiently sophisticated simulation to train the model for the desired task. After that, various transfer and fine-tuning methods are applied to build a bridge between the simulated and real domain.

However, in order to build the bridge to the real world, a small amount of data from the real world still needed and some adaptation techniques relies on implicit assumptions (e.g cluster, continuity assumptions) that may not hold for some tasks.

Domain randomization:

In Domain Randomization (DR), real world is just one of the many possible world. If the model have seen a plethora of simulated environments, real world would appear to be just another one.

One obvious disadvantage for DR is it would cost more to train a model on a distribution of environments, rather than train it on only one, regardless of real or simulated.

Apply DR to 3D reconstruction

Let say we want to train a model that can generate 3d objects from an image taken from cellphone.

Without having to create the 3D objects correspond in the images, we can render the training data en masse from synthetic objects only. To cover all cases in real world, we randomize the properties of our simulation to create a distribution over data:

Objects shape: use large amount of synthetic models from ShapeNet.
Viewpoints: Use Hinter sampling to cover all viewing angles.
Lighting: randomize light direction and color.

Different lightings and reflection for each view.

Reflection: use simple Phong-illumination model to simulate type and reflection intensity.
Background: use Indoor dataset to simulate the complex background of real world model.
Image augmentation: Final layer of noise and blur to blend foreground and background togethers.

Because training data is rendered from the simulation, additional supervisory signals such as segmentation map, depth value, etc. can also be created with virtually no additional costs. Those are great resources for experiments with multi-task learning or multi-modal self-supervised systems.

Fig.5 Sample render image (left), segmentation mask (center) and depth values (right).

In this case where input domain is 2D images, DR can be thought of as a complicated image augmentation pipeline. In other domains such as RL environments, randomization will also include non-visual properties such as friction or gravity.

Results:

Fig. 6: Input image — without DR — with DR

Figure 6 compares the results of 3D reconstruction models when trained on the naive renders (Fig. 1) versus the domain-randomized renders (Fig. 5). Note that the training inputs is entirely created from the synthetic ShapeNet data, it demonstrates that the gap between simulation and real world has successfully been covered.

Reconstruction samples from input image in the lower-left.

Conclusions:

Domain Randomization is promising approach and is actively being researched in fields that are heavily relied on simulation such as robotics. Even when real data can be collected, DR has some compelling advantages:

Can create more data cheaper and faster.
Can easily create multi-modal labels (RGB, mask, depth, etc) to use with good-old supervised learning.
No expensive data collection and labelling means easier to adopt for specific business needs.