Synthetic Data: how your AI startup can compete with the tech titans

Published in

Spilly

5 min readApr 14, 2018

Fact: In the world of AI the rich get richer. Google, Apple, Facebook… the tech behemoths are using the data collected from you and most everyone you know using their services to train their AI. The more high-quality data an AI has, the better it gets. How then does an AI-powered startup like Spilly compete with this? Well, it’s getting harder and harder. But now isn’t the time to raise the white flag. There is another way, and with it we were able to generate the same amount of data from a single engineer in only a few months, what previously took the combined resources of Microsoft, Facebook and others several years to collect. Not only that, but the process was much more efficient, the data “better” and more flexible than the existing datasets. More on those later, but first how we got to this point.

A brief glimpse of our simulant training — infinite people, poses and scenes — source: Spilly

A year and a half ago our startup was looking for datasets to help build our Human AR tech and found itself in a corner. After exploring every possibility, it was clear that hiring people to manually annotate datasets would prove too costly so we investigated state-of-the-art ways to get over this hurdle. We came across the Synthia Dataset at the NIPS conference in late 2016 and since that time had been using synthetic data to verify our object detection algorithm. That said, its effectiveness as an AI training method was still a great unknown but we had run out of options — we had to go all in on synthetic data.

So what is synthetic data, you ask? Sounds cyberpunk, doesn’t it? It is in fact very much a product of the here and now. It can be neatly summarized as artificial data generation that has offered us these key benefits:

Fast, low cost data generation — we were able to generate mass amounts of data in months, not years, with one engineer instead of dozens — that’s huge for a startup and a critical point in the race for better AI.
Greater data flexibility — instead of having to source real world images and pay for annotations, we can render any combination of person/object and scene virtually instantly.
Hyper-accurate labelling — with humans for example, we can train not just for “human” but for full-body, head, hair, clothing and skin, allowing us to train for hyper-specific segmentations. We can make this labelling as granular as we want. This approach would be too labour intensive for traditional hand-based annotation on the vast majority of budgets and in any case, far slower.
Easy adding of new labels for retraining — we can automatically add new labels to our dataset at any time, meaning there is no need to go back and reannotate an entire dataset (as the creators of the well known COCO Dataset were forced to do last year). This is a huge efficiency gain for anyone looking to retrain AI over the mid to long term.
No privacy infringement — it would be disingenuous to claim this as intentional from the start, but synthetic data itself does not impinge on your privacy in any way because it doesn’t require real world data (disclaimer: we do use a small amount of real world data as a sanity check for our training).

All of this translates to much lower cost. We started by training on humans for our Human AR tech but only because humans were a) the most frequently occurring ‘object’ in any scene b) the hardest object to start with, what with seemingly infinite poses/‘configurations’ (arms flailing, curled up into a ball, strange dance moves, etc.). But we can in fact apply the same technique we used for humans to any object. The logical step is to go down the list of ‘frequently occurring items’, but we can also be very specialized, i.e. if we want to work with a company to trigger an experience that exists on top of their signature product or merchandise, all we have to do is train for that object. Because object shapes vary less and product packaging is typically identical in all cases, these objects are significantly easier to train.

Our simulant renderings often produce “interesting” results-source: Spilly

But you may be asking yourself, does synthetic data really work? Is it just as effective? Our results show that indeed it is.

**Neural net quality comparison — COCO. Dataset vs. synthetic trained models. Results for Person category of Davis Datast — source: Spilly**

IOU (intersection over union) is the key accuracy metric here, and you can see that with the fifth model on the right we’ve got identical quality to the well known, non-synthetic COCO Dataset (Microsoft, Facebook) but with all the additional benefits outlined above — greater efficiency, accuracy and flexibility, all of which translates to lower cost.

That being said, we do supplement our data in some cases with a small number of publicly available real world images, roughly under 1% of the equivalent amount of real data you would need. We do this to check against any cases our synthetic data may have missed.

It wasn’t a slam dunk from the get go. When we started, our results were well below real data quality. Getting to this point required a lot of tweaking on our end. For our first case of humans, we went through a step by step trial and error process:

Original “simulants” (simulated people) were without clothes, which meant the training only learned to identify skin.
We then added clothes to the model and the results got better.
The model had trouble with clothes that were not solid colours.
We added patterns to clothes and network worked better on those things.
We then matched the simulant to the background to make the network better at finding people who blend into the scene.

Contrasting simulants were replaced with “blended” simulants to improve accuracy — source: Spilly

With these tweaks we were able to significantly bump up the quality to the level it is now. On top of this, a team of researchers at ETH Zurich are currently training Generative Adversarial Nets or GANs (a new form of neural net) based on our dataset to make the data look more like real images. We have to train our AI on the results of this, but we expect it to give us a significant further quality increase.

In sum, the synthetic data will increasingly be used to develop some very real, powerful tech in future. The benefits are clear for all, even the tech titans. One side benefit to all this? Hilarious randomized pictures that we run in a slideshow in the office. Perhaps coming to an art gallery near you. Until then, happy synthing.

Spilly Team @ Spil.ly

Additional Researchers (ETH Zurich)

Sergi Caelles — https://www.vision.ee.ethz.ch/en/members/detail/332/

Mathis Lamarre

Eirikur Agustsson — https://www.vision.ee.ethz.ch/en/members/detail/293/

Synthetic Data: how your AI startup can compete with the tech titans

Written by Spilly