Infinite Training Data for AI

8 min readMay 15, 2018

Deep learning algorithms are data greedy. We learn concepts by providing lots of training examples to learn good features. But unless you are an established tech company, data is often hard to obtain in sufficient quantities to make deep learning feasible. To solve this problem we need to transform the way we collect data to train machines. One solution is synthetic data which allows machines to learn arbitrary concepts with cheaper, faster and scalable data collection. With sufficient variation, this effectively creates an infinite source of data. Synthetic generation of concepts, environments and their interactions allows us to bypass expensive real-world data collection, democratises deep learning, and extends the reach of AI in society.

Data is the Problem

In April, we interviewed machine learning practitioners across industry to research deep learning in the wild. We talked to a range of companies, including big technology companies, deep tech startups and non-technology companies. There was widespread interest in deep learning applications, but few had actually integrated deep learning into their companies’ products. The reason for this? Data.

Deep learning requires large datasets, and usually, labelled datasets since supervised learning is going to be king for the foreseeable future. From our research, teams that wanted to move into the deep learning space just didn’t have the right amount and type of data needed to train their algorithms. Meanwhile, teams that were already in the deep learning space were often spending most of their time acquiring more data, or refining existing data, rather than focusing on their core research strengths. Additionally, from our talks at Entrepreneur First, the biggest obstacle in the deep learning idea space did not seem to be technical capability but data availability.

These pain points corroborated with our own experiences. Robert’s hair on fire problem at FactMata was acquiring a large dataset of hate speech to train machine learning models on. Similarly, Ross’s main restriction in applying deep learning for sports and financial trading was the absence of a high-quality dataset. For early-stage startups in particular, the data problem has been framed as a ‘cold start’ problem that prevents new companies from utilising cutting edge deep learning in industry applications.

So how do we overcome the data issues so more companies can exploit deep learning? We believe that we can solve the problem by generating training data in a smarter, cheaper, faster and more scalable way. This can be achieved by applying deep learning to the data collection problem as well as the model training problem. Through synthetic data we can democratise deep learning, and help get deep tech ideas off the ground where real data is initially lacking.

The Motivation for Synthetic Data

Deep Learning hallucinating sheep. Credit: Janelle Shane.

One of the major problems with deep learning at present is that representations are often not appropriately disentangled for good generalisation. A famous Twitter thread from a month ago pointed out the funny tendency for neural networks to hallucinate sheep whenever hills are present, because they are entangle feature representations, see above.

The hope is for algorithmic improvements that learn better types of representation that split out separate factors of variation (environment, form, interactions). In the meantime, more data is the main remedy — but this means more expensive real-world data collection and we hit the curse of dimensionality as we try to obtain more variations of form and environment.

But if we step back and think about the objective of our algorithms — the computational level of analysis — we are interested in helping machines capture underlying concepts in an image. For example, when learning hand gestures, we’d want a dataset of labelled hand gestures (“okay”, “point left”, “point right”). And we’d want to learn a representation that learns the source of variation we are interested in independent of other sources of potential variation(environment, skin colour, etc).

In the future, rather than expensively obtaining a large real-world dataset to capture these types of variation, it may be sufficient to simulate these sources of variation instead. We can then train a machine learning model on this cheap, fast, scalable data and transfer the knowledge to a real-world application. This is the motivation behind synthetic data generation.

Synthetic Data

To capture a concept that a machine needs to learn, we need not always acquire a real-world dataset. We can theorise the kind of data we need and create it synthetically. The quality of the approximation will then depend on the gap between the real and synthetic data distributions. Generating the data and labels synthetically can be much cheaper and less time-consuming than pursuing data collection and collation in the real world. We briefly illustrate a few examples of this synthetic data generation pipeline below, with a focus on 3D simulators in particular.

Data Creation

One source of synthetic data is through 3D game engines. Engines such as Unity or Unreal can be used to programme concepts, and generative models can be built on top of this. An example of this is the UnityEyes model of Wood et al, which the researchers used to render one million synthesised images of eyes in order to train a gaze estimation model.

Using Unity engine to generate eye gazes. Reproduced from Wood et al (2016)

3D game engines have also been used to train self-driving cars. Waymo has trained its software on billions of miles of simulated streets; NVIDIA recently released a self-driving car simulator; while researchers often use GTA V as a way to roll-their-own self driving car simulator. In these cases, it is much cheaper, faster and safer to use a virtual environment to kickstart training rather than using a literal car on real-life roads.

GTA V as a virtual environment for training self-driving cars

Similarly, in robotics, researchers from OpenAI introduced domain randomization as a technique to transfer knowledge from training robots in a simulated environment to real environment. This allows for “faster, more scalable, and lower-cost data collection than is possible with physical robots”.

A virtual robotic training environment from OpenAI. From: Tobin et al (2017)

Data Refinement

Graphics capability is improving year-on-year, but there might still be a significant quality gap between synthetic and real data, which means machine learning algorithms trained on synthetic data can overfit to its qualities. Fortunately, however, new refinement and domain transfer techniques in deep learning can help reduce the quality gap.

Apple used synthetic data for gaze estimation. They started with synthetic eye data from the Unity game engine, but then trained a refiner network with adversarial loss to help refine synthetic data and make it look more realistic. Their approach achieved a significant boost in performance over the use of synthetic data alone, which seemed to suggest promise in combining 3D synthetic generation of concepts + deep learning refinement techniques.

Apple’s use of adversarial methods as a refinement technique for synthetic images.

More broadly, the generative modelling literature holds promise for broader types of refinement. CycleGAN type architectures, which learn a domain mapping function, can help retain the core concept while performing style transfer without matched image-to-image examples.

CycleGAN domain transfer : Monet’s style replaced with a real-life photographic style.

More recently, an extension on CycleGANs — CYCADA — attempts to preserve semantic and feature information to help with transfer learning, and they applied this form of domain adaptation for learning between GTA5 (simulated data) and CityScapes (real data):

CYCADA : converting a video game into a photorealistic simulation.

Taken together, generating concepts synthetically through the use of 3D game engines, plus refinement through new domain adaptation techniques in deep learning, can help provide plentiful training data for machine learning applications. And the applications are not just for image data; we could also image, for example, synthetic data in natural language tasks for a new company that did not have any customer reviews yet, or synthetic data for time series data to help train reinforcement learning algorithms.

By theorising the type of content we need to train machines, we can generate the data synthetically, rather than obtaining expensive and difficult-to-obtain real life data. Through cheaper, faster and more scalable data, synthetic generation can help kickstart new deep tech companies in the deep learning space, and this means more good ideas can be implemented in practice.

Interested in using synthetic data?

Lets us know if you are thinking of using synthetic data or are using it yourself — we would love to have a chat and see how we can help.

Feel free to get in touch with Robert at robert.stojnic@gmail.com!

Robert Stojnic is a two time CTO and a Cambridge PhD graduate. He was a CTO at Factmata, where he lead the team that created the world’s first commercial fake news detection algorithm. Before that he was a CTO of GeneAdviser, a company that worked in partnership with NHS to help doctors choose the right genetic test. Robert holds a PhD in computational biology from Cambridge, with a specialisation in applying machine learning models to data.

Ross Taylor is a quantitative researcher and a Cambridge MPhil graduate. He has experience applying machine learning models for sports modelling and investment strategies. He also has experience in the open-source Python community, and authored the popular PyFlux time-series library. He holds a Masters in economics from Cambridge, with a specialisation in time series modelling and approximate inference.

Infinite Training Data for AI

Data is the Problem

The Motivation for Synthetic Data

Synthetic Data

Interested in using synthetic data?

Written by Ross Taylor