Types of synthetic data and real-life examples

Elise Devaux
Statice
Published in
8 min readJan 18, 2021

Journey into the world of data privacy — Episode 06

In this series, I share the learnings of my journey into the field of data privacy.

Episode 1: How “anonymous” is anonymized data?
Episode 2:
PETs: the technologies organization should consider adopting
Episode 3:
Introduction to privacy-preserving synthetic data
Episode 4:
10 use-cases for privacy-preserving synthetic data
Episode 5:
Data privacy and protection techniques
Episode 7:
List of events and resources for Data Privacy day 2021

This post presents the different synthetic data types that currently exist: text, media (video, image, sound), and tabular synthetic data.

After a brief definition and overview of the reasons behind the use of synthetic data, I go over several real-life examples of applications for synthetic data:

  • Amazon using synthetic data to train Alexa’s language system
  • Google’s Waymo using synthetic data to train its autonomous vehicles
  • Amazon using synthetic images to train Amazon Go vision recognition systems
  • American Express using synthetic financial data to improve fraud detection
  • Roche using synthetic medical data for clinical research

What is synthetic data, and why is it needed?

This post talks about synthetic data. That is to say data algorithmically generated, that approximates original data and can be used for the same purpose as the original.

There are a few reasons behind the need for synthetic data.

First, it can be a matter of availability, or in our case unavailability. Your organization or your team doesn’t have data or enough of it for a given use case. For larger organizations, legacy infrastructures and siloed data systems are often a cause of data unavailability.

In today’s data protection regulatory landscape, it can also be a matter of legal compliance. The data exists, but its processing is strictly regulated. For instance, the General Data Protection Regulation (GDPR) forbids uses that weren’t explicitly consented to when the organization collected the data.

Security concerns can also prevent data from flowing in an organization. The information is too sensitive to be migrated to a cloud infrastructure, for example. Governance processes might also slow down or limit data access for similar reasons.

Finally, it can come down to a matter of cost. A given data asset might be too expensive to buy or time-consuming to access and prepare.

Partially vs. fully synthetic data

Whatever the reasons a company turns to synthetic data, they can then choose to produce partially or fully synthetic datasets.

To produce partially synthetic data, you only generate some synthetic data points and use them to complement an existing dataset. This is helpful when just certain information is missing, or the data quantity isn’t sufficient for a given application.

You can also produce a fully synthetic dataset, with data that doesn’t contain any of the original information. Fully synthetic data is often used where privacy constraints restrict the use of the original data.

Fully vs. partially synthetic

Types of synthetic data

There are several types of synthetic data that serve different purposes. Synthetic data can be.

Synthetic text

Synthetic text is artificially-generated text. You build and train a model to generate text. Because of languages’ complexities, generating realistic synthetic text has always been challenging. However, the rise of new machine learning models led to the conception of remarkably performant natural language generation systems.

Last year, the OpenAI team introduced GPT-3, a language model able to generate human-like text. You can find numerous examples of text written by the GPT-3 model, with constraints or specific text inputs, such as the one depicted below.

This Shakespeare-like text was generated by the GPT-3 model, after training on original texts. Source: GPT-3 Creative Fiction,

Synthetic images and videos

Synthetic data can also be synthetic video, image, or sound. You artificially render media with properties close-enough to real-life data. This similarity allows using the synthetic media as a drop-in replacement for the original data.

None of the individuals in the picture below are real. These synthetic images were artificially generated by the Generative Adversarial Network, StyleGAN2 (Dec 2019) from the work of Karras et al. and Nvidia. The system learned properties of real-life people’s pictures in order to generate realistic images of human faces.

Source: thispersondoesnotexist.com

This method is helpful to augment the databases used to train machine learning algorithms. For example, when training video data is not available for privacy reasons, you can generate synthetic video data to resolve that. Similarly, you can use synthetic data to increase datasets’ size and diversity when training image recognition systems.

Tabular synthetic data

Tabular synthetic data refers to artificially generated data that mimics real-life data stored in tables. This data is structured in rows and columns. It could be anything ranging from a patient database to users’ analytical behavior information or financial logs.

Data is at the core of today’s data science activities and business intelligence. As mentioned earlier, there are multiple scenarios in the enterprise in which data can not circulate within departments, subsidiaries or partners. Synthetic data can be used as a drop-in replacement for any type of behavior, predictive, or transactional analysis.

Privacy-preserving synthetic data

On a different level, you can find this specific type of synthetic data that is privacy-preserving synthetic data. This is the focus of our work at Statice. The tabular synthetic data we generate comes with privacy guarantees. These measures ensure no individual present in the original data can be re-identified from the synthetic data.

Privacy-preserving synthetic data holds opportunities for industries relying on customer data to innovate. Modern data protection regulations often prevent any extensive use of such data. Privacy-preserving synthetic represents here a safe and compliant alternative to traditional data protection methods. It also enables internal or external data sharing.

Real-life examples using synthetic data

Using synthetic data for NLP

Synthetic data has application in the field of natural language processing. Amazon’s Alexa AI team, for instance, uses synthetic data to complete the training data of its natural language understanding (NLU) system. It provides them with a solid ground to train new languages without existing, or enough, customer interaction data.

“When a new-language version of Alexa is under development, training data for its NLU systems is scarce. […] The new bootstrapping tools, from Alexa AI’s Applied Modeling and Data Science group, treat the available sample utterances as templates and generate new data by combining and varying those templates.” Janet Slifka, director of research science in Alexa AI’s Natural Understanding group

Using synthetic data to train vision algorithms

When it comes to synthetic media, a popular use for them is the training of vision algorithms. For over a year now, the Waymo team has been generating realistic driving datasets from synthetic data. Alphabet’s subsidiary company uses these datasets to train its self-driving vehicle systems. It is an efficient way of including more complex and varied scenarios, as opposed to spending significant time and resources to obtain observations of similar scenarios.

Synthetic data is used as a replacement of real-life data to train autonomous car’s vision algorithm systems. Source: Waymo.com

“As its virtual cars drive through the same scenarios Waymo vehicles experience in the real world, engineers […] manipulate those scenes by virtually adding new agents into the situation, such as cyclists, or by modulating the speed of oncoming traffic to gauge how the Waymo Driver would have reacted.” Venturebeat

Waymo isn’t the only company relying on synthetic data for this use-case: GM Cruise, Tesla Autopilot, Argo AI, and Aurora are too.

In the retail industry, Amazon also deployed similar techniques for the training of Just Walk Out, the system powering the Amazon Go cashier-less stores. The team generated a considerable amount and variety of synthetic customer behavior data to train its computer vision system.

By using simulation to build a massive training set, the team was able to leverage the power of the cloud to train on months worth of data in a day, eliminating the time bottleneck and allowing rapid progress.How the Amazon Go Store works.

Using synthetic data for predictive analytics

The financial company American Express has been investigating the use of tabular synthetic data. Their data science team is researching how to generate statistically accurate synthetic data from financial transactions to perform fraud detection. They were already able to use the synthetic data to help train the detection models.

“To develop state of the art ML methods, including methods for anomaly detection and model interpretation, ML researchers and practitioners need to have access to data that is as close to the real one as possible. […] we show that synthesized data follows the same distribution as the original data,and that ML models trained on synthesized data have the same performance as those trained on the original data.” Efimov, Xu, Kong, Nefedov and Anandakrishnan (2020) in Using Generative Adversarial Networks to Synthesize Artificial Financial Datasets

In the field of insurance, where customer data is both an essential and sensitive resource, Swiss company La Mobilière used synthetic data to train churn prediction models. The data science team modeled tabular synthetic data after real-life customer data. They trained their machine learning models without compromising on the model performance or on their customer privacy.

“The Statice software protects the original data of our customers on the one hand, and on the other, enables us to work with the data across departments without compromising privacy or security issues.” Georg Russ, Data Scientist, Data & Analytics.

Using synthetic data as a data protection method

In general, all customer-facing industries can benefit from privacy-preserving synthetic data, as modern data procession laws regulate personal data processing.

For example, in the healthcare field, the use of patient’s data is extremely regulated. Roche validated with us the use of synthetic data as a replacement for patient data in clinical research. The german Charité Lab for Artificial Intelligence in Medicine is also working on developing synthetic data to generate data for collaborative research and facilitate the progression of different medical use cases.

--

--

Elise Devaux
Statice

Personal blog of a tech enthusiast, digital marketer interested in synthetic data, data privacy, and climate tech. Currently works at cozero.io