Synthetic Data Generation

Published in

Analytics Vidhya

10 min readJan 24, 2022

One of the most promising and still more underrated areas of artificial intelligence is the creation of synthetic data.

Talking about data analytics, data driven decision making or the embedding of highly advanced artificial intelligence tools in business, research or healthcare, the question of sufficient data availability is a common issue.

Deficiencies in data supply have different reasons. There are

companies which understandably do not want to expose their business data,
data simply have not the necessary frequency or volume,
legally enforced data protection rules prevent third parties to get access to private or sensible data.

Another common reason is that the variable of interest in the data set is heavily underrepresented. In other words, there is not enough labelled data to train a model.

But those are just some of the causes why there is a struggle for original data.

At the same time, artificial intelligence tools like neural networks/ deep learning environments need a lot of data in order to work properly.

This is a clear conflict of interest. And there is — at least partly — a way out: the generation of synthetic data.

Some of the use cases where synthetic data plays already an important role:

Amazon uses synthetic data to train Alexa’s language system
American Express uses synthetic financial data to improve fraud detection
Roche uses synthetical medical data for clinical research
Amazon again uses synthetic images to train Amazon Go vision recognition systems
Google’s Waymo uses synthetic data to train its autonomous vehicles

Synthetic Data — Basics

The creation of synthetic data does not simply mean e.g. the copying, reshuffling, anonymising or aggregating of the original data. This would not go far enough.

The problem in this approach is that it is still possible to re-track the original data and with this revealing private or sensible data again.

In contrast, record-swapping, suppression of sensitive values or adding random noise might be problematic from an analytical point of view. Doing this, important information in the data (e.g. dependencies, patterns,…) could be lost and lead to erroneous analytical results.

The solution here is not to focus on the original data stock itself but to shift the attention to the processes which led to the creation of the original data in the first place.

It is a probabilistic approximation of the original data. The basic idea of synthetic data is to replace some or all of the observed values by sampling from appropriate distributions so that the essential statistical features of the original data are preserved.

The generated synthetic data therefore does not contain any of the original, identifiable information from which it was generated. At the same time it retains the valid statistical properties of the real data.

The risk of reverse engineering or disclosure of the original data (e.g. a real person in healthcare data) is considered to be unlikely.

In the following chapters, four different approaches for generating synthetic data are discussed:

Parametric Synthesising Methods
Non-Parametric Synthesising Methods/ CART Method
Bayes Networks
Generative Adversarial Networks (GAN)

Which kind of synthesising methods are used depends on several factors, e.g. the volume/ frequency of the data, the type of the original data or the purpose for which the data is needed.

Parametric Synthesising Methods

Here, the idea is to replace some or all of the original data by sampling from appropriate probability distributions. More specifically, it means taking the observed values of single input parameters of the original data and approximating it with a probability distribution.

Like in this example where the occurrences of quarterly investment volumes observed in a real estate market (left hand side of the graph) is statistically approximated by an appropriate distribution (right hand side of the graph):

Then, all the “parameter” distributions have to be put into relation to each other. Technically, this is to be done with a joint distribution.

Finally, synthetic data points are created by simulation. Here are some examples based on different input variable combinations for a real estate market (blue … observed data, red … simulated data):

This approach works exceptionally well with a rather small data volume or when the frequency of observed data is low.

Non-Parametric Synthesising Methods

An alternative to parametric methods in synthesising data are machine learning techniques. One of the more promising applications are tree-based models. This is especially true for classification and regression trees (CART) — models.

CART methods are an algorithmic modelling approach which can be applied to any type of data.

CART models basically work by recursively splitting the data into groups with increasingly homogenous outcome. The respective split is done via yes/ no — questions with respect to the predictor space. And the values in each final group approximate the conditional distribution of the predicted variable for units with predictors meeting the criteria that define that group.

As a side note, the grid of interdependencies in those models can get too big and unmanageable quite fast. In order to avoid this, the joint distribution is defined and therefore approximated in terms of a series of conditional distributions.

The real advantage of CART models is the ability to capture automatically non-linear relationships and interaction effects in the data.

Here is a short example.

We synthesise a credit card portfolio and apply a simple credit scoring model to see if we can predict credit default cases. Then the accuracy of the model is examined when trained on the original data respectively trained on the synthetic data. Basis of the credit scoring model is logistic regression.

Some selected parameters of the data and its comparison between the original and the synthetic variables reveals how the synthetic data fits the distribution of the original data:

Doing a bit more iterations shows that the synthetic data is always very close to the original data. Here is the instance of one input parameter of the credit card portfolio:

But does the credit scoring model keep its behaviour when we replace the original data and train the model with the generated synthetic data?

The performance metrics of the credit scoring model when trained with the original data:

In comparison, the performance metrics of the credit scoring model trained with the generated synthetic data:

Even given the fact that it is a simple model, the results of the performance metrics are quite close to each other!

Bayes Network

A Bayes Network is a probabilistic graphical model. It encodes the conditional dependency relationships of a set of variables using a Directed Acyclic Graph (DAG).

In a Bayesian Network each node represents an input variable in the data set. And, a set of directed edges connects the nodes which forms the structure of the network. The set of conditional probabilities associated with each variable forms the set of parameters of the net.

Here is an example of a developed Bayes Network reflecting the conditional dependencies in two real estate markets:

In other words, Bayes Networks represent data as a probabilistic graph and from its structure, new synthetic data is simulated.

This new data is simulated by first sampling from each of the root nodes (i.e. those nodes alle the edges are getting off from, like ‘reference_rate’ or ‘yield_other’ in the upper graph). Then the sampling is followed by the children nodes but conditional on their parent(s) node(s) until data for all nodes have been drawn.

Again, the statistical properties of the original data are re-generated but there is no copying of the data itself. Big advantage of those networks is that they are also working with smaller data sets. Though, the more original data input the more accurate the synthetic data get.

Another issue is that one has to have sufficient insights about the structure which produced the original data. Hence, professional knowledge about the topic in question is indispensable.

This example gives a short insight on the workings of Bayes Networks in synthesising data (full credit for this example goes to Daniel Oehm: see references below).

We have a small medical data set handling different blood values for different sport types. The network structure:

Although the original data set is relatively small with just 202 observations and 6 variables, the synthetic data set is already quite close to the original data set. Which can be seen in the graphs below for some single input parameters (blue … original data, red … synthetic data):

Generative Adversarial Networks GAN

When exposed to original data with a highly complex feature space and a huge amount of observations, neural networks are the right choice to deal with. Audio and visual data fall exactly into this category.

In this context, e.g. Generative Adversarial Networks (GAN) are suited for the synthetic generation of images.

In short, GANs consist of two Deep Learning structures, a Generator and a Discriminator. The task of the Generator is to build up synthetic images. The Discriminator is initially trained with the original real images. Based on what it learned, it classifies the (fake) images delivered by the Generator into either ‘fake’ or ‘real’.

Goal of the Generator is to build up fake images which are at a level so that the Discriminator classifies them as ‘real’. And goal of the Discriminator is to recognise ‘fake’ as ‘fake’. The feedback given by the Discriminator is used by the Generator to improve its fake/ synthetic images. And the Discriminator uses the original images as well as the fake images built by the Generator to improve its own classification model.

See the following chart for an overview:

Point is that the Generator never in that process has a direct link to the original data. Its synthetic process is initially triggered by a random set of data which in no way is connected to the original data (also called white noise). At the end of the process, synthetic images are created which have the same statistical properties as the original data but have no direct connection.

Hence, a re-tracking to the original data via the synthetic data is quite improbable.

The amount of data and the complexity of the feature space have additional requirements on the model infrastructure in order to keep the model calculation and the model speed manageable. GPU-based instead of CPU-based modelling and cluster computation are some of the preconditions. The TensorFlow/ Keras — respectively the Torch — environment offer the necessary setting to do this.

Conclusion

The employment of highly advanced machine learning and deep learning applications triggers a ubiquitous quest for data, as those engines are pretty data-hungry.

The availability of data on the other hand is too often limited by ethic and legal reasons as well as business considerations. In other cases, the variable of interest simply cannot be observed often enough in order to be able to train the machine learning or deep learning models accurately.

The call for data democratisation respectively attempts to pool original data from different sources and make it available to the public more than often fails due to the reasons stated above.

In contrast, synthetic data is a fast and accurate option to provide artificial intelligence tools with the necessary data stream.

Modern tool kits for synthesising data are focusing on the process of how the original data were created. As a result, the statistical properties of the original data are preserved without copying them. A re-tracking of the original data via the synthetic data is therefore unlikely.

This fact and the possible prep-up by additional cautionary measures like Statistical Disclosure Control (SDS) makes synthetic data quite an alternative!

References

Attached are some references. Though, there is a lot of really good content out there and this is just as small snapshot covering the topic.

Synthetic Data by Surya_Nuchu published in Analytics Vidhya/ April 26, 2021

synthpop: Bespoke Creation of Synthetic Data in R by Beata Nowok, Gillian M Raab and Chris Dibben first published in Journal of Statistical Software/ October 28, 2016

Utility of synthetic microdata generated using tree-based methods by Beata Nowok/ 2015

Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R by Beata Nowok, Gillian M Raab and Chris Dibben published in the Statistical Journal of the IAOS 33 (2017) 785–796/ 2017

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing by Debbie Rankin et.al. published in JMIR Med Inform/ July 20, 2020

Simulating data with Bayesian networks by Daniel Oehm published in Rbloggers/ October 15, 2019

Bayesian Network Example with the bnlearn Package by Daniel Oehm published in Rbloggers/ September 30, 2018

bnstruct: an R package for Bayesian Network Structure Learning with missing data by Francesco Sambo and Alberto Franzin published in Rbloggers/ May 22, 2020

dbnlearn: An R package for Dynamic Bayesian Structure Learning, Parameter Learning and Forecast by Robson Fernandes published on LinkedIn/ July 30, 2020

Create Data from Random Noise with Generative Adversarial Networks by Cody Nash published in Developers/ n.a.

Basic Idea Of Generative Adversarial Networks (GAN) Machine Learning: Torch in R by DKWC published in RPubs by RStudio/ June, 2021

Generating images with Keras and TensorFlow eager execution by Sigrid Keydana published in AI Blog/ August 26, 2018

TensorFlow for R from RStudio/ 2015–2020

Binary image classification using Keras in R: Using CT scans to predict patients with Covid by Olivier Gimenez published in Rbloggers/ January 1, 2022