5 Things Data Scientists Should Know About Generative AI for Synthetic Data

Published in

CodeX

8 min readMar 12, 2023

One thing is clear — data science teams experience blocks having to do with obtaining data. Strict data governance policies at organizations restrict access to customer information needed to train advanced machine learning models that could be game changers for organizations. Protecting customers is key, but it comes at a price.

Data is also difficult to come by! Datasets are often biased — they have more information on certain population groups than others, don’t represent enough cases of a rare event, don’t provide enough power for machine learning models, or all of the above.

Synthetic data can be a solution to all of these woes.

It is natural for data scientists to be skeptical when they hear the term “synthetic data.” We are taught to be picky about the data we develop our models and make predictions with. The first assumption of any statistical model is that the data was collected accurately and without error, and that’s because models are only as good as the data put into them.

Understanding the importance of quality data and the unique needs of data scientists when it comes to working with synthetic data, I compiled a list of the 5 most important things data scientists should know about synthetic data. Hopefully these will help you see the advantages of folding synthetic data into your teams’ workflow.

1. Synthetic data comes in many forms

The data you work with as a data scientist could range from text, to audio, to video, to images, to pure tabular data — and so does synthetic data. Choosing the most suitable data synthesis technique depends on a number of factors, including the type of data, the amount of existing data, and the intended use-case.

The first style of synthetic data can be described as rule-based (or simulation-based) data. This synthesis begins by specifying the rules of the data generation process. Some examples include rendering labeled training data for computer vision tasks, or creating synthetic fraudulent financial transactions by simulating agents. Using this technique you can also simply specify the distributions of fields in your data. Rule-based synthetic data has proven useful in the context of computer vision, and can be an excellent way to test application code when no usable data is available.

The drawback of rules-based synthetic data generation is that it struggles to reflect the inherent complexity of the statistical properties of your data. Because rules are defined manually, they often miss the subtleties of relationships between features. Furthermore, this method is not suited towards highly dimensional data, as again, these rules are defined manually. Finally, should the trends or distributions in the data change, the rules would need to reflect that, making maintenance difficult.

Alright, so rule-based synthetic data isn’t perfect for all scenarios, so what’s the alternative? Our favorite friend of late, generative AI.

Choosing the most suitable data synthesis technique depends on a number of factors, including the type of data, the amount of existing data, and the intended use-case.

AI generated synthetic data is created using deep learning models that learn the complex relationships and patterns of the original dataset. This enables the generative models to produce data that more reliably reflects the complex and unique patterns in the original data.

This data, however, can suffer from being too much like the real data should the generative models overfit to the original data, which compromises the security of the synthetic data and requires additional privacy preservation techniques. Further, these powerful models take more compute to run and generate data than simple rule-based methods and require some existing data to use as training data.

Generative AI, however, does produce data that is more dynamic, changing as your data changes, and looks, feels, and acts like your real data. In the rest of the article, we will discuss the key things to know about AI generated data, as it is the type of synthetic data most quickly growing in popularity.

2. Synthetic data can capture the complexities of your data

Sure, synthetic data is fake, but it’s not random. Powerful deep learning neural networks can understand the nuances of the statistical properties in your data. AI Generative models produce data that replicates these properties to give you data that trains machine learning models as if they were trained on your real data.

This of course doesn’t just mean matching the distributions of both your categorical and continuous data, which it does extremely well, but it also means accurately representing the relationships between features.

Take this sample data of average transaction amounts of customers from a Czech bank:

Clearly synthetic data can replicate the distributions of the real numeric data accurately. But what about the relationships between features in this banking dataset?

The strengths of the relationships between numeric variables in the real dataset are almost identical to those relationships in the synthetic dataset.

Because these statistical properties are reflected in the generated data, it can then go on to train machine learning models that yield almost identical results to models trained on real data.

Adjusting how long models train for and no other parameters, you can see that the synthetic data gives you results increasingly close to what your real data would give. Because the synthetic data captures the behavior of your real data, you can use these results to inform actual business insights.

3. Synthetic data can address biases in your datasets

Unsatisfied with the distributions of features in your real data? Synthetic data can help with that too. Capturing the natural patterns of relationships between data points and features, you can add those with specific properties you need to your dataset. This opens up the possibilities of debiasing models by balancing imbalanced classes and/or addressing representation bias.

Class imbalance within datasets is especially problematic when you are trying to predict rare events such as fraud or customer churn. Synthetic data helps make these predictive algorithms more robust by providing more data from these rare events for models to learn from. Because the statistical properties of your data hold up in the generated synthetic dataset, what you get are data points representing the specific and unique characteristics of those who’ve churned or of fraud events.

Further, datasets containing information collected from customers can suffer from representation bias. This issue is most present in survey data. Often customers in certain demographic groups respond to surveys more frequently than customers in others. This leaves you drawing insights from data that may not be true for a group of your customers. With synthetic data, you are able to balance out these characteristics. Say you have noticed that fewer women than men answer your survey — you can train a generative model on the entire dataset and sample the records from women to augment your data — amplifying these voices.

4. Synthetic data can remove barriers to data access

Keeping customer data safe should always be a top priority for your organization. Unfortunately, this comes at the expense of efficient workflows. Interdisciplinary work speeds up innovation, and not being able to share data between teams and with trusted partners can cause bottlenecks that halt progress. Further, sharing results and insights widely throughout your organization makes a big difference in the quality of work you as a data scientist can produce.

Synthetic data allows you to share customer information… without sharing customer information. By generating data that looks and feels like your customers’ data, but isn’t, you can safely share this data within your org with peace of mind.

Data democratization doesn’t just have to mean freely sharing information within your own organization either. You can live out your values of transparency and collaboration by including this synchronous data along with your open source code for the developer community to reproduce your results with ease and add to your code. This is also the case for external partners that supplement your data science teams. With synthetic data you can bring in experts to help build out your machine learning infrastructure or reporting mechanisms without all the red tape.

Synthetic data allows you to share customer information… without sharing customer information.

This game changing use case of course requires a lot of trust in your synthetic data solution’s ability to generate data that can’t be traced back to your customers, while at the same time being just as robust as the real data. Data synthesizers can use a variety of methods to do this, ranging from simple noise injection to deep learning frameworks. As the field continues to grow, new tools are being developed to assess the privacy of datasets used to train models and new methods are being developed to generate data that is protected against attacks.

5. Synthetic data can’t do it all

While synthetic data can help unblock many barriers in your workflows, sadly it is not the be-all end-all solution to every problem your data science team faces.

As mentioned previously, data is hard to come by, especially quality data. So it would be really nice if there were a simple intuitive solution to create the data you need but don’t have in order to draw the kinds of insights you want. Unfortunately, synthetic data is not that solution. Data cannot be created from nothing.

An existing dataset is necessary for AI models to learn the patterns of. This creates limitations for possible use cases. If your organization is trying to supplement your real data with information on a population that is completely not represented in your data, or you are trying to figure out what would have happened if you hadn’t implemented a new feature, unfortunately, there is nothing for the generative model to learn from. It’s artificial intelligence — there needs to be something to know.

There also needs to be enough to know. Generative models require enough information to learn from to be able to produce quality data. You can absolutely supplement your datasets with synthetic data, however, if you have an especially small dataset, say less than 1,000 records, the synthetic data produced from this amount of data won’t be reliable enough to represent the additional records needed for your analysis.

Don’t be late to the synthetic data game

Synthetic data has the potential to unlock innovation at your organization. From addressing biases in your data to supporting data democratization and governance at the same time, you will see your workflows transformed. While there are several open-sourced solutions, integrated platforms can also be extremely useful.

Don’t shy away from synthetic data, give it a try and I’m sure you’ll find your teams’ productivity sky rocket.