Can Secure Synthetic Data Be Useful?

A comparative analysis

Published in

CodeX

7 min readMar 3, 2023

Strong data governance is not only about security, it’s also about accessibility and accuracy. While your customers are trusting you to keep their information secure, your data science teams need data to build and validate new models with this valuable yet sensitive information.

The latest in generative AI can produce data that is statistically similar enough to your production data for Data Scientists and ML Engineers to build with and push your company’s goals forward. At the same time, synthetic data should be distinct enough from your customers’ identifiable information to meet your data governance standards and protect your customers’ privacy. I look at how the synthetic data generated in Tonic Data Science Mode (DSM) balances these priorities.

The generative models in DSM are built with deep neural networks that train on your source data to accurately approximate its distributions. The Variational AutoEncoders used to build these generative models are regularized to prevent memorization of the dataset. What you get is a completely new dataset distinct from your source data, but with the same statistical properties.

More training iterations will generally result in data that is more similar to the real data, compromising the security of the synthetic data. This poses a natural challenge: the higher the capacity of your synthetic data to behave like your production data, the lower the capacity of the synthetic dataset to protect the identities of your customers.

In this post I demonstrate that data from DSM performs comparably to real data on a classification task while safeguarding the privacy of the original dataset.

Investigating the Utility and Privacy of Data from Tonic Data Science Mode

I conduct an empirical analysis of the privacy-utility trade-off using the UCI Adult Dataset. The data contain 15 columns of information on peoples’ employment, education, socioeconomic status, finances, and demographic information such as marital status, sex, and race, and comes split into a training and a held out test set of 32,561 and 16,281 records, respectively. This data is commonly used for benchmarks of synthetic tabular data.

While your customers are trusting you to keep their information secure, your data science teams need data to build and validate new models with this valuable yet sensitive information.

To investigate the effect of increasing training iterations on the utility and the privacy of the generated data I train several models in DSM for varying numbers of epochs.

Utility Measurements

To measure the utility of the synthetic data generated in DSM, I examine the performance of classification models trained to predict whether a person’s annual income exceeds $50,000 from their demographic characteristics. I fit the models to both the original dataset and the synthetic datasets. ROC AUC scores are calculated on the held out test set to evaluate the performance of these binary classification models. I compare the scores of the models built with data from each generative model configuration in DSM with the score of the model trained on the original dataset to assess utility.

Privacy Measurements

When I refer to the “privacy” of synthetic data, what I am really concerned about is the ability of an attacker to re-identify individuals from the synthetic data. If the synthetic data points are too similar to the real data, this hypothetical attacker has an easy job.

To quantify how similar the synthetic data is to the original data, I use a distance to closest record (DCR) calculation. This method involves calculating the Euclidean distance to the next closest record for each datapoint in the original dataset. Next, distances are calculated from each synthetic datapoint to the closest real datapoint as illustrated below.

The greater the synthetic-real DCR, the more distinct the synthetic data point is from its closest real data point and the more difficult re-identification would be, and the more “private.” Conversely, should a synthetic-real DCR be at or close to 0, this means that the synthetic data point is simply a copy of the real data point and there is little to no privacy protection.

The challenge with interpreting synthetic-real DCRs is that, by design, synthetic data points lay within the same range as the original data, making it more likely for them to be relatively close to a real record. Comparing the distribution of distances of synthetic records to real records with the distance of real records to themselves allows us to understand if the synthetic data is further from the real data than the real data is to itself. This would mean that the synthetic data is more private than a copy or a simple perturbation of the real data.

Below are two histograms showing the DCR distributions from data generated in DSM trained for 1,500 epochs and 1 epoch. The data from the model trained for 1 epoch has about 5,000 fewer data points that are 0–0.05 from the closest real record as the real-real record DCRs. Further, this model produces more data with longer distances to real records than between the real records themselves. Compared to the distribution of DCRs from the 1,500 epoch model, which has more synthetic records on the low end of the DCR scale and more so tracks the DCR distribution of the real data, I can conclude that the data from the 1 epoch model is more private.

The function to calculate this metric is found in the Tonic reporting library. For the purposes of this empirical analysis, I summarize this metric for each dataset by taking the median DCR.

Results

I sample data from 50 models in DSM trained for 1–50 epochs. Each model is trained 20 times to control for randomness in the neural networks.

*The lines represent the median value from each configuration and the clouds represent the interquartile range of the 20 runs at each epoch*

There is an overall initial drop off in synthetic-real median DCRs within the first ten epochs. As the DSM model trains on the initial dataset from 1 to 5 epochs, the distance from the generated data points and their closest real data points drastically decreases and shows a slower decline between five and ten epochs, then levels out. The IQR of these values at each epoch is very narrow, indicating little variation from run-to-run.

The greater the synthetic-real DCR, the more distinct the synthetic data point is from its closest real data point and the more difficult re-identification would be, and the more “private”

The ROC AUC scores have a dramatic reaction to increasing epochs even earlier than the median DCRs, with an initial spike in scores around three epochs. After about eight epochs, the scores seem to level out as well. There is more variability in scores from run-to-run than in median DCRs, as represented by the IQR cloud.

Because median DCR and ROC AUC values seem to level out at such a low number of epochs, I test if this trend holds with larger numbers of iterations. To check out the trends on the upper end of the epoch scale I train DSM models on the census data for 1, 150, 300, 450, 600, 750, 900, 1050, 1200, 1350, and 1500 epochs.

After 150 epochs, the increase in data utility slows down, as does the decrease in data privacy. The ROC AUC of the synthetic data approaches its real baseline, but the leveling off of this metric suggests that further changes in the model architecture would be necessary to improve utility further, such as adding layers or features. While the median DCR remains above the baseline median DCR the more epochs the model trains for, increasing the model capacity as suggested above could narrow this gap further.

Allowing the DSM models to train for longer on source data produces higher utility data, while the privacy of the data, measured by the median synthetic-real DCR decreases but — critically — never reaches the median real-real DCR

The picture I paint here suggests that my hypothesis — given a fixed model architecture, the longer a DSM model trains for the more similar it will be to the original data set as measured by median DCR (and thus less private) — is not fully true. I do demonstrate evidence of a utility-privacy tradeoff here that would be more interesting to explore with changing model architectures as well.

How Tonic Data Science Mode can optimize the utility-privacy tradeoff of synthetic data

Using synthetic data can be a great way to democratize the data in your org without compromising your values or responsibilities to your customers when it comes to maintaining the privacy of their data.

DSM models use advanced AI technology to learn the unique statistical structure of your data and generate synthetic datasets that perform just like your production data. Allowing a DSM model to train for longer on source data produces higher utility data, while the privacy of the data, measured by the median synthetic-real DCR decreases but — critically — never reaches the median real-real DCR. These results are extremely promising as companies continue to find solutions to the increasing need for stricter data governance policies. Further investigations into the impact of model architecture on this important tradeoff would bring further insights on how DSM balances privacy of the original data with the utility of the generated dataset.