From Domain Randomisation to Structurally-Aware Synthetic Data Generation
Improving model generalisability with virtual synthetic data; A case-study for product recognition in retail
Last month at the Tesla AI Day, the world was stunned once again by Elon Musk’s ambitious goals to create a humanoid Tesla Bot designed to help us humans with boring and repetitive tasks. What some of you might have also seen was that exactly half an hour beforehand, the Vision team at Tesla introduced their auto-labelling & simulation pipelines that are at the core of their self-driving efforts and are backed by thousands of in-house manual labellers and 3D artists. With amazing procedural and automated tooling they still need a lot of humans in the loop. With a market capitalisation of close to 700 billion and being one of the most successful companies in the world, Tesla is amongst the lucky ones that can afford such manpower to label and annotate their video clips. Once they have acquired enough data they can auto-label and construct ultra-realistic simulations of the world.
The hard reality though, is that most companies that want to apply computer vision in their business workflows are often limited by access to data. Furthermore, annotating and labelling datasets for object recognition is one of the most costly, labour intense, error-prone tasks. An MIT study recently showed a staggering 6% error rate for labels on ImageNet, the most widely used dataset for benchmarking state-of-art models in classification tasks.
Data-Centric AI & Synthetic Data in Retail
At Neurolabs, we believe that the key to solving these issues and unlocking the next milestones in Computer Vision is to train models with synthetically generated data using 3D Graphics engines such as Blender, Unity or Unreal.
Within the recently introduced paradigm of Data Centric AI, the goal is not to create better models but to increase performance by changing the data itself. Even with huge amounts of data, developing such control for computer vision tasks cannot be done solely by acquiring real data. One must look at controlling all parameters that influence the computer vision models, such as camera, lights, object pose, resolution, occlusions, or edge cases.
In this post, we’ll focus on the impact that synthetic data brings to product recognition in retail. The specific use case we’ve chosen here is real-time shelf monitoring to improve on-shelf availability in supermarkets. In particular, we’ll look at:
- Comparing performances of various forms of synthetic data, as evaluated on real datasets, from Domain Randomisation (DR) to Cluttered Domain Randomisation, and finally, Structurally-Aware synthetic scenes.
- The power of generalisation of synthetic data in the small data domain, i.e. less than 100 images and across slight variations of the same domain, i.e. different aspect ratios, camera poses, light conditions & on-shelf product structure.
- Why it’s not scalable to acquire and increase variation with more real data, cost & time tradeoffs.
Real Datasets & Synthetic Data Techniques
The request to create synthetic datasets and train computer vision models in this particular case came from an early stage startup, SuperRobotics, that is building an innovative, autonomous smart robot for supermarkets. They wanted to demo a product detector for 70 classes of Spanish supermarket products. The goal was to train a product recognition model on synthetic data.
SuperRobotics were able to collect close to 500 real images, which were carefully annotated and prepared to be used for object detection. The total number of object instances was close to 10,000. The images for the Stitched datasets were constructed using an image stitching technique, one of SuperRobotics’ area of expertise.
The dataset “Cropped” contained 381 images of cropped shelves, with an average of 10–15 products per image and a fixed resolution (aspect ratio) of 3280x2464 (1.33) as well as varied light and camera positions.
The dataset “Real Stitched 1” contained 48 images of full-shelves with resolution (aspect ratio) 3280 x 5144 (0.63)
The dataset “Real Stitched 2” contained 55 images of full-shelves with slight variations in camera, lighting, and positioning of products on the shelves as compared to “Real Stitched 1” but with the same resolution.
Starting with the 3D digital assets of the 70 classes we generated synthetic data using various techniques such as Vanilla Domain Randomisation (Fig. 3) and Cluttered Domain Randomisation (Fig. 4). In this scenario, the idea is to introduce as much variation as possible by randomising parameters such as the pose of products, backgrounds, light, and camera, as well as HSV. These techniques have been successfully used both in research and practice but, as we will see, fall short when dealing with structured scenes and a high number of classes. These various datasets contained close to 250 images each and 6000 balanced instances of the original 70 classes.
The second synthetic dataset (labelled “Synthetic” in Table 1) contained 200 images of structurally-aware scenes (Fig. 5) with the same aspect ratio as the real data. Products are placed in a semantically meaningful way using base scenes of shelves and by randomising the properties, and placement, of products such that the difference between the real data and synthetic data is minimised.
A third dataset was a mix of the “Synthetic” one and a small subset of the real data (20%).
Finally, the “Generalised Synthetic” dataset had 1500 images with 5 different resolutions (aspect ratios) and was used to test the learning transferability to all real datasets.
- 2048 x 1152 (1.77)
- 2256 x 1504 (1.5)
- 2260 x 2160 (1.04)
- 4096 x 5120 (0.8)
- 2520 x 4880 (0.51)
All of the above synthetic data techniques are currently available as part of Neurolabs’ platform, where scenes and generation parameters can be easily configured.
For all experiments we’ve used the same object detection model, a D3 EfficientDet, and a COCO pre-trained backbone as a starting point, with the first 2 layers of the network frozen. The training was done using a 80/20 train/val mix and early stopping based on validation loss. Standard optimisers were used throughout, with no hyper-parameter tuning used to boost the model’s performance. In Table 1 you can see the mAP evaluation results outcomes as part of the real2real and syn2real learning experiments.
Model Generalisation with Synthetic Data
In Table 1, we present combinations of training and evaluation as well as the performance mAP. There are two main findings that are strongly suggestive of the power of generalisation for synthetic data.
DR2real vs. Syn2real vs. Mixed2real
Domain randomisation techniques (Fig. 3, Fig. 4) that are relatively easy to construct and only require access to 3D assets (no scene construction required) reached a performance of 71% and 72% mAP on both real datasets. These methods can be used to jump-start the computer vision model before fine-tuning it with more real data.
The structurally-aware synthetic data was constructed automatically using base assets that are composable and that generate a shelf, as well as carry out smart placement of the products on this shelf, such that the domain gap with real data is minimised. Compared to DR, there is a mAP increase of 10–15% as evaluated on the real domain.
Finally, mixing the synthetic dataset with 20% real data (12 images) from the first real dataset increases the performance of the model to 94%. Together with some post-processing steps for finding the best IoU threshold for inference, this is what is required for deploying the model to production.
Syn2real vs. Real2real
While testing using a model trained on real data only, i.e. “Real Stitched 1”, very small changes in the environment that make the test set different, i.e. “Real Stitched 2”, have a big impact on the overall performance. There is a gap in the mAP of ~ 20%, which is indicative of poor generalisation across slight variations of the real domain.
As we can see, this is not the case with synthetic data, which brings a lot of variation in this low-data scenario and is capable of generalising when moving across very similar, real domains, with 85% and 91% mAP on both real datasets. Controlling parameters such as camera, lights, and pose of the products is essential to generalisation, and more importantly, can be easily achieved with synthetic data. If we mix 12 real photos with the synthetic data, we achieve production ready results across both real datasets i.e. 95% mAP.
Taking a step further in the direction of generalisation, we observe that while testing the real data model on a more varied, real dataset with different aspect ratios, there is a drastic performance decrease, as evidenced by the two rows in Table 1. A decrease from ~90% to 4% mAP is a huge change in the real-data results.
We circumvent this issue by generating the “Generalised Synthetic” dataset of 1500 images, with 5 different aspect ratios. As can be expected when increasing generalisation, the performance decreases slightly but is able to accurately detect the products across a wide range of variations, with similar performances of 73%, 74%, and 52% on all three real datasets.
Cost & Time Trade-off
The entire process of gathering the real data and annotating it was split into 3 steps and took close to 1 month to complete. Firstly acquiring the “Cropped” dataset then “Real Stitched 1” and “Real Stitched 2”. The total cost of bounding box and class annotation for 10,000 bounding boxes was $700.
The total process to generate all synthetic datasets and train CV models took a total of 3 working days to complete for one of our Computer Vision Engineers using our data generation platform. It was 95% cost-effective. The time spent setting up the synthetic data configurations and scenes took 1 day, whereas the actual rendering process and automatic labelling of all datasets took a total of 3 hours to complete. An extra one and a half days were spent on training and evaluating the models.
Although these numbers speak volumes to the time and cost trade-offs between real versus synthetic data, perhaps the most appealing argument is the ability to control and alter the data with quick iterations such that it is easily adapted to the specific use case at hand.
We’ve presented a retail case-study where virtual synthetic data provides a scalable solution for on-shelf product recognition, solving both the data availability and annotation problems. Creating variation and controlling the generation parameters increases the model’s generalisability across small domain shifts.
AI-veteran Andrew NG suggested that a mentality shift is required to advance the field of AI in a more meaningful way, proposing a data-centric AI mentality as the potential solution: “A sample of recent publications revealed that 99% of the papers were model-centric with only 1% being data-centric”. With that in mind, what better way to be more data-centric than controlling how your data is created?
Retailers worldwide lose a mind-blowing $634 Billion annually due to the cost of poor inventory management with 5% of all sales lost due to out-of-stocks alone. 🤯
Neurolabs helps puts an end to out-of-stocks using a powerful combination of computer vision and synthetic data, improving customer experience and increasing revenue. 🤖🛒