The Startup
Published in

The Startup

The Power of Synthetic Data — Part 1

Introduction

In one of our previous blog posts and webinars, we’ve written about the power of synthetic data and Neurolabs’ synthetic data engine that transforms 3D assets into robust synthetic datasets for computer vision. At Neurolabs, we believe synthetic data holds the key for making computer vision and object detection more accurate, more affordable and more scalable. We provide an end-to-end platform for generating and training computer vision models on rich and varied annotated synthetic data, taking away the costs of labelling and considerably shortening the time-to-value of object detection solutions.

One of the main problems encountered when developing computer vision solutions using synthetic data is that of domain adaptation, namely, how to transfer the knowledge of deep learning models trained on synthetic data to real data. Some of the work done by researchers focuses on two main overarching approaches:

  • Automatic scene generation to robustly capture variation and features present in real datasets [1], [2], [3].
  • Framing the learning of synthetic datasets as an end-to-end optimisation problem, that is good at solving specific computer vision tasks [4], [5].

This is the first of a series of blog posts where we go in the direction of the second approach and illustrate the power of randomising the parameters of a synthetic scene to generate robust synthetic datasets. Furthermore, we evaluate the versatility of these datasets using two state-of-the-art object detectors, FasterRCNN and EfficientDet on three real datasets of increasing complexity.

In this blog post, we aim to prove the viability of synthetic data in a simple task and to explore the synthetic to real domain gap incrementally. From this stepping stone, we can start looking at more complex problems.

Dataset

We introduce two carefully crafted datasets, a real one and a synthetic one, and show that with a combination of 95%-5% synthetic-real data, we can achieve better results than real data when evaluated on the real test dataset.

We selected 10 classes for this task:

  • Orange
  • Banana
  • Red Apples
  • Green Apples
  • Bun Plain
  • Bun Cereal
  • Croissant
  • Broccoli
  • Snickers
  • Bounty

In both cases, the classes are balanced and we have around 100 instances/class for the real dataset and 250 instances/class for the synthetic one.

To assess that the model is learning the specifics of each class, we have constructed pairs of product classes that share a high degree of inter-class similarity both in terms of shape (e.g. bun cereal vs bun plain, red apple vs green apple) and texture (e.g. bun plain vs croissant plain).

Obtaining similar performance when training the models on synthetic data and real data offers us information about the quality of our synthetic data and how large of a gap remains between the real and the synthetic domain.

Real dataset

We use the real dataset as a benchmark for the synthetic one. The images were captured using a fixed setup, on a plain white background with small light variations and no object occlusions. For each class we use only one object instance to create our dataset, i.e. the same banana is used across the dataset.

These images were annotated by our team, in a painful process that took over 20 man hours.

Figure 1. Real training image with all product classes.

Synthetic dataset

Our synthetic dataset and annotations were generated automatically, using Neurolabs’ Synthetic Data Engine. We focused on the following virtual environment scene randomisation parameters: lighting conditions, poses, textures, and location. We use one 3D asset for each class to generate all the object instances in the dataset, ensuring a fair comparison to the real dataset.

For this generation, we varied the following generation parameters:

  • Scaling — we set a predefined scale for each 3D asset according to the relative size of the object in the training data. We define this scale as 1x the default size. For this experiment we vary the object scales in range 1x to 2x.
  • Rotation — we introduce random rotations of our object on all axes. To capture the variations that are most likely to appear in the real data, we constrained rotations to what we call “natural rotations”. Example: having a banana perpendicular to the plane that objects are placed on can’t be done if we respect the dataset format we specified.
  • Location — place objects in random locations in the image. We automate how much objects overlap and ensure they are within the image.
  • Lighting — we vary the number of lights in the image (1–3), the light colour ([R,G,B] = [{0.5–1},{0.5–1},{0.5–1}]) and the light max intensity.
  • HSV — we provide subtle changes to a single objects texture values for hue, saturation, and value, whilst camera is fixed, looking down.
  • Number of objects in each image (min 2 — max 4).
Figure 2. Synthetically generated image.

Test set

The unifying element between the real and synthetic datasets is the test set. The object instances are the same ones used in the train dataset, and images are captured in the same setting.

Figure 3. Test set images containing all 10 product classes.

Synthetic vs. real comparison

As we can see in Figure 4 below, there is a visual difference between the real and synthetic images. Further improvements to the photorealism of our generation will solve this issue.

Figure 4. Real vs synthetic data side-by-side comparison.

Model Training

We used two well known object detectors for our task, namely:

  • EfficientDet, a single shot state-of-the-art detector, which can be scaled up, having seven different versions of backbones from EfficientDet-D0 to EfficientDet-D7, each increasing in complexity, input shape, BiFPN channels and layers. This type of scaling paired with concepts such as Weighted Feature Fusion and BiFPNs account for most of the improvement brought forth by EfficientDet. In our scenario we used a D3 model with the first two layers of the backbone frozen.
  • FasterRCNN, a well known two shot detector. Its architecture consists of the Resnet backbone network, a region proposal network (RPN) and ROI heads. For our experiments we used a GeneralizedRCNN architecture with a Resnet50 backbone with its first two layers frozen.

All the models have been pre-trained on the COCO dataset. This gives a great starting point for the detection task and helps the classification for classes that are seen in both the COCO dataset and in ours: “apple”, “banana”, “broccoli”, “orange”.

Experiments

Table 1. Experimental results.

The experiments in Table 1 keep the same hyperparameter configurations for jobs trained with the same model, to ensure a fair comparison between them. For the synthetic data, we average over 3 runs of data generation.

The results of our experiments on real data act as a good benchmark. The mean average precision is 96.1% and 94.3% for EfficientDet and FasterRCNN respectively, and give us a strong baseline for a model trained on the real dataset with ~100 instances per class.

We can observe a difference in performance when comparing them to the models trained purely on synthetic data. The domain gap between the real and synthetic data accounts for an absolute drop of about 12% in mAP. Taking into account the cost and time needed to create a real dataset (1 day) compared to the one needed to generate the synthetic one (30 minutes), we already have an argument for using synthetic data in certain tasks.

However, the best results on our test dataset are achieved when combining real and synthetic data. With only 50 randomly sampled images from our real dataset, which contained around 300 images, we have managed to obtain better performance than our real-only dataset.We obtained similar results when adding 100 images to our synthetic one, the returns diminishing as more data is added.

Figure 5. Validation losses across all experiments.

Figure 5 presents how our validation losses evolved during training. For all experiments involving synthetic data, the validation set was chosen from the available synthetic data.

We presumed no real data would be available at train time, so that our tests show the generalisation of the synthetic data we built, unbiased by the specifics of the environment it will operate in. This technique acts as a sanity check for the correctness of our models and against overfitting.

Top Losses

A simple but effective interpretability technique to better understand and validate the generated synthetic dataset as well as the parameters used to generate it is to look at the images with the highest losses during model training, as suggested here and here. We investigate the top 9 training/validation images that produce the highest losses to reveal which sample images are the hardest ones to predict on.

We first apply the method on the real dataset, and as illustrated in Figure 6, the hardest image samples to learn, are the ones that have a lot of objects. This suggests that one of the key factors to take into account when generating the synthetic dataset, is to have more objects in an image.

Secondly, when looking at the top losses when trained on mixed synthetic data, we notice in Figure 7 that the model is sensitive to varying light conditions as well as making the object texture similar in colour to the background.

Figure 6. Top Loss Validation Images (Real Training).
Figure 7. Top Loss Validation Images (Synthetic + Real Training).

Conclusion

In the first out of a series of blog posts, we showed that synthetic data generation is influential in obtaining high accuracy models, models that need to be deployed quickly, and models for which data acquisition is cumbersome or not possible.

Training models purely on synthetic data takes us 90% of the way when recognising objects, whilst mixing real and synthetic with only a small amount of real data, we can outperform models trained solely on real data. These promising results will be the basis of future blog posts and experiments, where we will increase the complexity of both real and synthetic datasets to mimic the variation encountered in the real world.

Written by Daniela Palcu, Flaviu Samarghitan & Patric Fulop, Computer Vision Team @Neurolabs

Sign up for the Alpha version of our Synthetic Generation and Computer Vision platform!

References

[1] Bridging the Reality Gap by Domain Randomisation

[2] Meta-Sim

[3] Meta-Sim 2

[4] Auto Simulate

[5] Learning to Simulate

--

--

--

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +756K followers.

Recommended from Medium

Hyperparameter tuning on AI Platform

Extraction of Vietnam address data from the unstructured text

Diagnosing Diabetes- an AI-Based Approach

Gradient Boosting and Weak Learners

Centroid Neural Network and Vector Quantization for Image Compression

Machine Learning is sometimes wrong — how you deal with that is EVERYTHING

Ridge Regression in Machine Learning

Building an Object Detector in a few steps

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Neurolabs

Neurolabs

We help retailers automate time-consuming and costly business processes using Synthetic Computer Vision.

More from Medium

Compositional Generalization in Semantic Parsing (EMNLP 2021 paper notes)

From Windows to Volcanoes: How PyTorch Is Helping Us Understand Glass

What Is Explainable AI and What Is It Used For

How to Reduce the Technical Debt in ML Projects