Pills of AI 101: The Synthetic Data

When having data scarcity is not a problem

Jeremy Sapienza
Data Reply IT | DataTech
7 min readMay 15, 2023

--

Figure 1.0 — Fake image or not?

Have you ever needed a lot of data to train your model? Have you always asked project owners for online archives where you could collect the data and information you need for your project?

Typically, these are the requests that are often made to customers or members of your software development team, when data is lacking in a project to train your machine learning model and to advance the project you need to complete by the expiration date. The results can be fatal.

Here the need to GENERATE emerges, totally new data, which do not refer to any existing data in the real world, but which are typically aggregated to give lifeblood during the training phase of our machine learning models.

Today, in this article I will show you the SYNTHETIC DATA. Revolution in the era of the AI world that is currently taking hold and that we will hear about for the next 10, 20, 30 … years!

What is a Synthetic Data?

Synthetic Data are artificially generated data that mimic the statistical properties of real-world data. It is generated by using algorithms or models that simulate the underlying structure of real-world data.

This type of data is often used in machine learning applications when real-world data is limited or sensitive, making it difficult to access or share. By generating synthetic data, researchers can create larger and more diverse datasets that can be used to train and test machine learning models without compromising the privacy of the individuals represented in the real-world data.

Synthetic data can be used to augment real-world data by adding more samples, creating additional variations or anomalies, or balancing the distribution of classes in the data.

Figure 1.1 — A GIF that illustrates synthetic samples made by Snapchat to create their filter faces - Synthetic Data for Machine Learning (snap.com)

How to generate them?

The power of algorithms, today is really high. Why not consider this power to generate Synthetic Data?

A famous US-based research and technology consulting company called Gartner, Inc. says that the AI allows to generate a huge amount of data and more companies, today, are using the 60% of synthetic data for training their models. They estimate that in 2030, synthetic data will completely overshadow real data in AI models.

Figure 1.2 — Gartner, probable stats about Synthetic Data in 2030

There are different techniques to generate the synthetic data. We can reproduce these data with methods like:

  • Random generators
  • Interpolation between real-data
  • Perturbation between real-data

Random Generators

There are different algorithms that allow us to reproduce new data following certain properties. For example, we can follow a predefined distribution (e.g. Normal, Uniform, Poisson, and more), a predefined set of possible numerical, categorical values, and more.

Today, the data science community has different libraries that give the possibility to generate data randomly. It is important to say that there isn’t a perfect generator tool that reproduces what you want, that is depending on your needs, a tool could be better than others. There are libraries like Faker, Mimesis, Pydgen, PyOD, Data Synthesizer, Synthetic Data Vault (SDV) … and more!

For example, if we see Faker, the most used library in Python, with only one line of code we can have people profiles with fake information:

from faker import Faker

synthetic_obj = Faker()

person_data = [synthetic_obj.profile() for i in range(5)]
dataframe = pd.DataFrame(person_data)
Figure 1.3 — Code sample of the above code

The major drawbacks here are that:

  • We don’t know from which distribution these data are generated
  • We are not always sure that we guarantee anonymous values
  • We cannot be more specified for having certain fields

.. but as a good point, we can leverage the power of these libraries for having more data in a few minutes!

Moreover, if you don’t know anything about language programming you can use GUI tools like:

That in a few minutes can give you the possibility to interact with a pre-configuration of the data you will want!

Interpolation between real-data

Interpolation is the process of generating synthetic data points between two or more existing data points based on their values of the minority or sometimes majority classes.

Given a set of data points, interpolation algorithms can be used to estimate the values of the function at intermediate points that were not explicitly measured or observed.

This technique is much used because is based on a known distribution of your real data. Today, there are many famous algorithms and their variants, used in different fields such as:

  • SMOTE (Synthetic Minority Oversampling Technique)
  • VAE (Variational Autoencoder)
  • GAN (Generative Adversarial Network)
  • KDE (Kernel Density Estimation)

We need to keep attention that these algorithms can help situations where we need to guarantee:

  • Privacy of the data
  • No Bias in data
  • Balancing your dataset
  • More data

An example of oversampling technique could be represented in this way:

Figure 1.4 — Smote situation

It is important to say that these algorithms allow us to balance the prediction performances, in fact through a classification report we can see the performances through all classes that we handle during our processes. If our model seems to have a great performance without applying an oversampling method, that performance is not properly true it is biased and unbalanced. So, using an oversampling method could make real your performance.

Perturbation between real-data

Data perturbation is a technique for generating synthetic data. It involves applying statistical algorithms to the original data to introduce random “noise” over the values, thus creating a modified version of the data that is still representative of the original but different enough to protect the privacy of individuals.

The major drawback is based on human error because we need to take care of the feature value distribution.

Ethical laws: GDPR Compliance ..and more

Ethical laws are more important to guarantee the privacy of the information treated during our main processes.

Figure 1.5 — Some of compliance entity

For example, The GDPR (General Data Protection Regulation) and HIPAA (Health Insurance Portability and Accountability Act) are two sets of regulations that govern the use of personal data and protected health information, respectively.

These entities gave ethical considerations over the usage of these synthetic data such as:

  • synthetic data used must be sufficiently anonymized to protect the privacy of individuals.
  • synthetic data is generated using methods that preserve the statistical properties of the original data.
  • synthetic data must not be used to discriminate against individuals or groups.
  • synthetic data must be used only for the specific purposes for which it was generated.
  • synthetic data must be deleted or destroyed when it is no longer needed.

In this case, these few points are totally respected by these synthetic data. But at the same time, with the new AI era, LLMs will modify some of these rules.

Colab DEMO (A Practical Example)

If you want to have a practical example of how to use Synthetic Data in a real-world scenario, or if you want to have simply a notebook. Check my compiled colab notebook: https://colab.research.google.com/drive/1TeiCMkZe9B99e0OjZZH2WlTrvpCVPewc?usp=sharing

Pro & Cons

The advantages of using synthetic data are based on:

  • Cost-effective
  • Privacy protection
  • Unlimited availability
  • Controlled Data

Instead, the disadvantages are:

  • Limited accuracy
  • Lack of diversity
  • Limited usefulness
  • Ethical concerns

Conclusion

  • Synthetic data has gained popularity in solving problems in the fields of machine learning and data science
  • Using synthetic data has numerous benefits but also some drawbacks like the risk of overfitting and potential bias in the synthetic data generation process
  • The future of AI over synthetic data will improve in terms technique used to generate them, new privacy laws in terms of data security

Synthetic data can be a useful tool in the data science toolkit but should be utilized judiciously and with careful consideration of its strengths and weaknesses.

References

[1] Is Synthetic Data the Future of AI? (gartner.com) — A Q&A with Alexander Linder made by Gartner Inc.

[2] Top 10 Python Packages For Creating Synthetic Data — https://www.activestate.com/blog/top-10-python-packages-for-creating-synthetic-data/

[3] For Faker fast example — https://github.com/Sanjay-Nandakumar/Faker_Library/blob/main/Faker_library.ipynb

[4] Keras examples of generating data — https://keras.io/examples/generative/

[5] 5 SMOTE Techniques for Oversampling your Imbalance Data — https://towardsdatascience.com/5-smote-techniques-for-oversampling-your-imbalance-data-b8155bdbe2b5

[6] What is synthetic data NVIDIA Blog — What Is Synthetic Data? | NVIDIA Blogs

--

--