Synthetic data — is it useful?

In the data science, machine learning and artificial intelligence field there are always hype trains riding around, with new and innovative buzz words being whispered here and there. For quite some time now, one of those topics has been synthetic data. What is it, you may ask? My goal is to answer to all of those questions, that I had, when I first started researching this topic.

In my everyday work at SIFR, I have mostly been working with time series data analysis and forecasting, to give companies a competitive advantage in multiple fields. We believe, that most data related issues and predictions can be solved with simple models and unless you are going after the 99,9% accuracy rate, there is no need for super complicated machine learning solutions, that require heaps amounts of time, and in reality, do not increase profit more than 0.01%. If we can solve an issue with less time, less cost and the benefit for the company is considerable, we are happy, and the client is happier. Since in real life, most basic needs of the clients can be met using simpler ML solution, it is interesting to sometimes dive into topics that are not part of my everyday work and enable me to research and try new things. This time it was synthetic data.

The topic of synthetic data has been buzzing all around the machine learning community for few years now, but the topic itself is much older. Big companies, that have access to vast amounts of data are in a lucky position to do exciting projects, compared to smaller companies or regular data science enthusiasts like you and me. We have to find other ways to mine data to fuel the projects and ideas we have at hand. This is where synthetic data comes into play. Or at least I thought it would come in to play.

Firstly, synthetic data is not completely fake data and it is not random or simulated data. The term itself suggests that it is not naturally available in the real world, but has it’s so called roots in the real-world values, sort of like synthesized from it. It is an arguable term, since sometimes random data generators use the same term to describe their automatically generated random datasets, but in the machine learning field “Synthetic data is information that’s artificially manufactured rather than generated by real-world events. Synthetic data is created algorithmically, and it is used as a stand-in for test datasets of production or operational data, to validate mathematical models and, increasingly, to train machine learning models.” (Read more)

The hype has increased in the recent years, due to the obvious increase in processing power and storage capabilities, while also increasing the possibility to test out new algorithms faster. This enables the development of new potential methods that can be used to create more data for analysts, engineers and scientist to use in their work. Enabling people to have more data to work with, is not the only use case of synthetic data. Since synthetic data follows the distribution or other attributes of the original raw data, it allows personally identifiable datasets to be transformed into anonymized datasets, that still demonstrate the natural relationships of the variables. This can be useful for example in medical and banking fields, where the anonymity is extremely important, but so is having a lot of data, to do better analysis and fit better models. As mentioned before, synthetic datasets enable machine learning engineers to have more data to train the models on, and to validate the results. It also enables us to test out how systems act, if completely new, out of the ordinary data is used. It enables more efficient natural language processing and is used in the autonomous vehicle industry to train the model and others image processing fields. From another perspective, synthetic data can be used in creating all sorts of visual and sound forms, merging together the reality and algorithms, enabling people to look at things in a new perspective, while enhancing our creativity. Besides wide variety of use cases where synthetic data can be used, one major benefit of it is definitely the speed, time and cost, that is saved from not having to manually label the data or find the real data.

By now, I hope you grasp the majority of benefits of synthetic data. While I first started to research the topic, I assumed that there are thousands of examples available to put into use in a matter of seconds and to generate me more data. That, unfortunately, is not the case. And this brings me to the disadvantages or, better said, minor and not so minor inconveniences of the usage of synthetic data, that prevent us to use it in great scales and slow down the creation of synthetic data.

First of all, please do not get the impression that with the existence of synthetic data you do not need any real data beforehand. That is wrong, as you still need the original data to generate the synthetic data from. Depending on the issue at hand, you might still need a lot of data to start creating synthetic data. Besides the amount of data you have, you also have to keep in mind that the synthetic data is highly dependent of the model that created it and while it aims to imitate the original data, there might be some parameters or specifications that are left out of the model and lead to poor synthetic data.

This adds another layer of complexity to the topic, since it is difficult to track all the relations between features, that would be necessary to generate valid and usable synthetic data. Here it is brought out that you cannot actually be 100% sure, how the newly generated synthetic data, that closely follow the original data, acts. Minor differences could hinder the actual outcome a lot. All this makes creating synthetic data difficult. It requires a lot of software engineering skills, even image processing and text analytics skills and general data science knowledge, depending of the problem and project you are currently working on.

After gaining some insight into the benefits and liabilities of synthetic data, how do you actually create it? I stumbled upon a question by someone, who asked what are the general approaches in creating synthetic data, and mostly the answers were that.. there is no general approach to that, you sort of have to try, experiment and see, how it turns out. Since each synthetic dataset is really domain specific, there is no “easy way out” approach to create more data, that suits the needs of everyone.

Broadly said, there are two methodologies in the field of creating synthetic data. Both of them include various different ideas and techniques and can be solved in many different ways.

1. Distribution-based data creation

This method assumes that you have original data and want to add more of it or replace the original data for security and personal data reasons. The initial concept is quite simple. You would have to find the probability distribution for each feature and then use a random generator where you can add the probabilities to. Sounds super easy when you have one feature. The complexity comes out when your data has more than one variable and more than one association that needs to be matched to create workable synthetic data. Then you would be dealing with a joint probability distribution for a set of variables. There are different aspects regarding numerical and categorical data, and with the first one you would also have to take into account the mean and the standard deviation of the values, not only the probabilities and associations between features.

This is the reason why the creation of synthetic data is getting the hype, but unfortunately is not as widely used as it could be. Every single detail about the dataset adds complexity to generating it automatically. There are many researches done on the topic and different solutions have been proposed, but unfortunately a simple way to implement it, that could be used in many cases and different datasets, has yet to be published.

2. Agent based modelling

Firstly, agent-based models (ABM) are models that are built up of individual decision-making entities called agents and the relationships between those agents. In the field of machine learning and synthetic data, ABM can incorporate neural networks or other learning techniques to enable realistic learning and adaption. One of the underlayers of agent-based models can be concluded with a term called generative models. Three most known approaches from that segment are generative adversarial networks (GANs), variational autoencoders (VAEs) and autoregressive models.

Generative adversarial networks are unsupervised machine learning algorithms that consists of two neural networks, where one is generative and another discriminative. This means that the generative network produces images for the discriminative network to assess and label. This method requires also initial datasets for the discriminative network to learn on, to understand what is real and what is not.

While GANs generate whole images, then variational autoencoders have a different approach. Autoencoders are types of artificial neural networks that learn efficient amount of data codings in an unsupervised manner and are able to reproduce the original data uncompressing that codings into something that closely matches the original data. VAE enables to learn from a set of predefined data (images or text) and modify it slightly, in a needed direction, to generate the output (new data). VAE tries to find continuality between different features of the input dataset labels and generate something that is in-between two true values, while exploring variations on the data you already have.

Example of the usage of VAE on MNIST dataset

Third method is autoregressive models. For me, autoregressive models are known through the time series analysis and forecasting point of view, when future timesteps are predicted based on the previous time steps. Similar approach is with generating images. The network is trained to generate pixels based on the previous pixels that are above and to the left of the current pixel.

In addition to the distribution and agent-based models, there are of course many other ways to create synthetic data. These probably would go under the title “Experimental”. Many different companies are offering solutions to automatically generate image data for machine learning models and I am sure many more are to come.

Screenshot from Greppy Metaverse article

And as mentioned before, there is no right or wrong way how to generate synthetic data, there are just some key takeaways you have to follow or keep in mind when looking into creating synthetic data. Trying to generate synthetic data is not easy. It requires a lot of research, data preparations and analysis and skills to generate the data, that suits your needs specifically. Let’s hope that the convenience of using synthetic data increases soon and some open source solutions arise.

SIFR is a company focused on empowering your data using artificial intelligence and big data. We pride ourselves by finding real world business value from AI solutions and we help companies set up data science projects, build AI teams and develop an AI strategy. Visit us @