Is Synthetic Data Really Safe?

Published in

AI FUSION LABS

7 min readApr 7, 2023

A POV on privacy in the realm of synthetic data

Privacy is one of the key determining factors when it comes to transparency and trust in AI. Without ensuring that the data is private, we cannot hope for a fair and robust AI system that is built using the said data. However, before we get to the heart of this discussion let us first establish what data privacy is.

Data Privacy:

‘Privacy’ has to be one of the most thrown-around buzzwords in tech/ AI discussions/ debates however, there is an absence of consensus on what exactly privacy means. This probably stems from the lack of an objective definition. Therefore, let us try and define privacy in a semi-formal manner without getting mathematical.

Privacy for a dataset can be defined as its imperviousness against adversary attacks (e.g. re-identification attacks, inference attacks, etc). Let us assume that an adversary has access to certain information about their victim, now using this information (often referred to as auxiliary information i.e. Age, Gender, etc.) if the adversary can identify the victim from the said dataset, then we say that the data is not ‘private enough’ and vice versa.

That brings us to the next point, that privacy is not quite black and white. There is no data that is absolutely private (while retaining any utility; we will come to this point next). Therefore, we have to accept the fact that there will always be some residual risk, however, the goal is to minimize this risk without demising the data utility completely.

Data Utility:

Data utility refers to the usability of the data for analytics purposes such as clinical research/ ML modeling etc. Essentially, we want to retain the utility of the data as much as possible while trying to make it resilient against adversary attacks. However, it is well established in the literature that privacy comes at the cost of decreased utility.

The Battle of Privacy & Utility:

We often find ourselves in a pickle where we want the data to be private however that privacy comes with the cost of utility. On the other hand, a high utility generally would mean that the data is not perturbed and therefore not private. This statement holds true across all prominent privacy mechanisms including anonymization [1], statistical de-identification [2], and Differential Privacy methods [3–4], etc.

So, we ask ourselves, if there is any way around and if yes, what is it?

Synthetic Data to save the day:

AI Researchers as well as privacy-practitioners are increasingly paying more attention to synthetic data namely because of the following reasons,

1. Synthetic data as the name suggests essentially is fake data. Therefore, it is much harder for an adversary to map a synthetic/fake sample to an original record.

2. Synthetic data (ideally) are supposed to retain the multivariate distributions/ characteristics of the original data therefore they remain analytically useful.

3. Synthetic data does not face any regulatory red tape as original data would, therefore it can be shared easily which encourages collaboration and in turn fuels innovation.

4. We can generate as many synthetic samples as we want hence low sample size is not a limitation with synthetic data.

5. Synthetic data does not change the schema/ nature of the data as opposed to some de-identification methods which change the feature space (i.e. numerical columns are binned, therefore become categorical columns).

So now we know why synthetic data is gaining traction. However, with the rise of synthetic data adoption, there are some raising concerns as well that needs addressing.

Challenges with Synthetic Data:

The first point is of course about the non-trivialness of the synthetic data generation process. Without going into the nitty-gritty of this topic (we will keep this for another medium article) it is fair to say that creating high-fidelity synthetic data that not only looks and feels like real data but also retains its statistical properties is a cumbersome process. However, that’s not all. Even if we can create analytically rich synthetic data, the question that often comes next is, if this synthetic data is really as safe and void of any privacy concerns as it is often assumed to be.

Privacy Risk with Synthetic Data:

Synthetic data is not real data, it’s artificial. Therefore, it can be hard to comprehend how can there be privacy issues with it. Well, we will explore and try to answer exactly that in the next sections. Let us start by understanding a few different types of attacks that an adversary can perform on synthetic data,

1. Re-identification Attack: When an adversary tries to re-identify a sample from a synthetic data pool and map it back to an original record they perform a re-identification type of attack (i.e. similarity/distance-based attack).

2. Distinguishability Attack: When an adversary tries to distinguish a sample or set of samples showing unusual behaviors and uses that information to single them out, is referred to as a distinguishability attack.

3. Attribute Inference Attack: When an adversary tries to, infer some sensitive personal information about their victim (not really interested in re-identifying or distinguishing them), is referred to as an attribute inference attack.

Now having learned what these attacks are, let us explore how synthetic data can be susceptible to such attacks respectively,

Synthetic data have identical feature space to their original counterpart (unlike de-identification which transforms the feature space by binning/ aggregation etc.). Hence, it is easy to do a similarity check using a distance-based metric (i.e. a statistical distance to compute the distance between a synthetic and an original sample in the multi-dimensional space). Therefore, it is not hard to find the synthetic samples that are ‘very similar’ or ‘look like’ original samples, leading to the adversary being able to perform a 1:1 map and re-identify a record as illustrated in figure 1.

2. Good quality synthetic data retains the multivariate relationships present in the original data. Let us take an example where we have age and income as two attributes in a HR data. Plotting a bi-variate histogram we observe that the density in the high-income high age region is quite low. Now the generative model learns this trait and replicates it in synthetic data. Therefore, an adversary can easily use the same information and distinguish the low-density samples (i.e. with high age and income) and get similar insights as they would from the original data. This is referred to as a Distinguishability attack (illustrated in figure 2).

3. Let us take another example where the adversary is interested to reveal the compensation (i.e. attribute attack) of their victim and they have auxiliary information about the victim’s age and gender (e.g. maybe from LinkedIn). Now if the adversary has access to a synthetic dataset (that is created based on the employee database of the victim’s organizations) then they can do a machine learning-based attack. All they need to do is train an ML model using (age and gender) to predict compensation and then feed the trained model with the victim’s auxiliary information to reveal the compensation as illustrated in figure 3.

Final Takeaway:

The main intention of this article was to establish that synthetic data is not as risk-free OR safe as often presumed. They can be vulnerable to different kinds of attacks as explained above. However, does that mean we should not be exploring synthetic data?

On the contrary. Synthetic data is instrumental for innovation and research in Analytics/ Machine learning/ trustworthy AI (especially in data-sparse settings). But at the same time, we must be vigilant (and diligent) with how we generate our synthetic data and ensure privacy on that.

In the next Medium blog, I will introduce you to a potential approach of privatizing synthetic data (ZS PRISIM) while retaining its analytical utility that we have built at our ZS AI Lab (the work is published in NeurIPS 22). Catch you all at the next one!

References:

[1]. Porter, C. C. 2008. De-identified data and third party data mining: the risk of re-identification of personal information. Shidler JL Com. & Tech., 5: 1.

[2]. Hajian, S.; Domingo-Ferrer, J.; and Farras, O. 2014. ` Generalization-based privacy preservation and discrimination prevention in data publishing and mining. Data Mining and Knowledge Discovery, 28(5): 1158–1188.

[3]. Cynthia, D. 2006. Differential Privacy In Automata, Languages and Programming, Bugliesi Michele, Preneel Bart, Sassone Vladimiro, and Wegener Ingo.

[4]. Dwork, C.; McSherry, F.; Nissim, K.; and Smith, A. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, 265–284. Springer

Author Bio:

Dr. Subhrajit Samanta is a senior AI Research Scientist with ZS Associates. He received his Ph.D. from Nanyang Technological University, Singapore in 2020. His primary expertise includes Time series forecasting, statistics- classical ML, Privacy preservation, Synthetic data generation, and Deep Learning (RNN) techniques.