An approach to data de-identification and synthetic data generation

Josephlyr
d*classified
Published in
10 min readSep 30, 2023

Lo Jingjie developed solutions on Synthetic Data Generation and Dataset Augmentation to de-identify sensitive data through privacy protection, whilst preserving dataset distribution. He was mentored by DSTA Data Scientists, Chua Kah Sheng and Joseph Low.

Photo by Alina Grubnyak on Unsplash

Introduction

As demand for high quality data increases, institutions require effective processes and techniques for removing personal information from datasets to uncover insights from underlying trends, while adhering to data privacy requirements. This is achieved through the process of de-identification — a form of dynamic data masking that breaks associations between data and the individual with whom the data was attributable to. Tools are used to remove or transform personal identifiers (e.g. names, departments) to enable reuse of data across teams and authorized third parties. At the same time, we employ synthetic data generation mimic the distribution characteristics of production data to create de-identified, realistic, and synthetic data for our development and testing.

During my internship with DSTA, I’ve developed two python packages to achieve the aforementioned goal:

Project Data Anonymizer: A Python Data Anonymization package that provides functionalities to generalize and suppress data to fulfil k-anonymity privacy guarantees.

Project SynPiper: A Python Synthetic Data Generation package and front-end Web Application that integrates open-sourced statistical and deep learning Tabular Synthetic Data Generation techniques.

Project Data Anonymizer

Motivation

Data Anonymization is important for maintaining privacy and confidentiality. It involves the removal, alteration, and randomization of Personal Identification Information (PII) from user data. The lowered risk of re-identification encourages data practitioners to share their anonymized data for collaboration and development purposes. The goal of this package was to compile widely used anonymization techniques so that users have a tool to preserve and evaluate the privacy of their data.

Approach

Data Anonymizer sought to apply generalization and suppression techniques on quasi-identifiers to reduce identifiability of individuals in a dataset. For this purpose, the widely known privacy paradigm, k-anonymity is applied. k-anonymity is a privacy framework that states that each record in the data should share the same values with at least k − 1 other records in the dataset. For example, if a dataset is considered 2-anonymous, we need to ensure that there is at least one other record with the same quasi-identifiers’ values. According to the Personal Data Protection Commission (PDPC), the industry threshold for k-anonymity value is at 3 or 5.

Design Flow

Figure 1: Design Flow of Data Anonymizer

When Data Anonymizer is instantiated with the dataset, it infers the property types of each column: the Data Type, Information Type, and Sensitivity Type. A brief description of each property type is stated below. Property types that do not conform to the user’s expectation can be modified by the user.

  • Data Type: Categorical, Numerical, Datetime, Unique/sparse, Others
  • Information Type: Specific nature of the column (NRIC, Email, Handphone, Others)
  • Sensitivity Type: Direct Identifiers, Indirect Identifiers, Sensitive, Non-Sensitive

After the property types have been set, the module recommends the appropriate anonymization techniques for the columns. Again, users will have the option to modify the type of anonymization technique, including some of its parameters (e.g. number of bins), to meet their desired privacy risk levels. The table below lists the available anonymization functions.

Table of supported anonymization functions

Results

Suppose that a dataset contains the Gender, Age, Blood Type, and Salary of individuals. We can consider Gender, Age, and Blood Type to be quasi-identifiers (indirect identifiers) of a person as these are sensitive attributes that can reveal the identity of a person. After performing generalisation (no suppression in this scenario), we can see that the individuals with similar quasi-identifiers can be grouped up together. In this case, the record is 3-anonymous because this record has at least 3–1 = 2 other records that share the same quasi-identifier values.

Figure 2: Demonstration of Anonymization

Evaluation Criteria

Re-identification Score

Records that share the same quasi-identifiers belong to the same equivalence class. Referring to the above diagram, suppose that the attacker wants to know the salary of his target (Gender: Male, Age: 35, Blood Type: A). As this record is 3-anonymous, there are 2 other individuals/records that share the same set of quasi-identifiers in this equivalence class.

This means that there is a 1 / 3 chance of the target’s salary to be revealed. The re-identification score for the target is 33%. The way to calculate the re-identification probability is therefore 1 / (equivalence class size). This metric allows us to know the privacy risk of every equivalence class and determine whether further generalization is needed to ensure that all equivalence classes have an appropriate class size.

Figure 3: Formula for re-identification probability

Summary statistics of re-identification score

By looking at the different equivalence classes, we can compute the average and maximum re-identification score of the dataset. The average score allows us to gain an overview of the privacy risk of our transformed dataset. Most importantly, the maximum score can be used to determine the k-anonymity of the dataset. If the maximum score is 50%, this means that each row in the dataset has at least one matching row with the same values, resulting in an equivalence class size of at least 2. As such, the dataset can be considered to be 2-anonymous.

Proportion of rows above k-threshold

The proportion of rows can be computed as follows. Let X be a record in the dataset and i be the equivalence class size of the record.

Figure 4: Formula to calculate proportion of rows above k-threshold

Typically, as we increase the k-threshold, the proportion of rows that are at least k-anonymous decreases. This makes sense as increasing the k value would mean more records need to have a larger equivalence class size to fulfil the threshold. Significant declines in the distribution across various k values can provide valuable information regarding the privacy disparity of data in the modified dataset. Moreover, such declines might signal the need for additional generalisation and suppression to attain a higher k-anonymity score.

This metric is represented in a chart like the figure below. There are 2 line plots, the `User` line plot and `Default` line plot. The default line plot serves as a benchmark, demonstrating the outcomes of auto-suggested transformations. In contrast, the user line plot illustrates the actual score based on user-selected transformations. The user’s approach involves more generalization and suppression, resulting in significantly higher anonymity guarantees compared to the baseline transformations.

Figure 5: k-Threshold plot

Project SynPiper (Synthetic Data Generation)

Motivation & Approach

Many open-source libraries like Synthetic Data Vault offer a variety of synthetic generators, with some offering differential privacy guarantees. While such libraries are capable of handling synthetic data generation alone, they all have different data pipelines. Hence, the goal of SynPiper is to integrate all synthetic data generation techniques into a unified package, with similar pipelines, to reduce the variability in different configuration settings. Subsequently, users of this package will be able to use a variety of evaluation metrics to evaluate the quality of their synthetic samples.

Approaches for synthetic data generation

Statistical Generation and Differential Privacy

Brief Overview of Differential Privacy

Differential Privacy is a framework for statistical and machine learning methods to ensure that the identifiability of individual data points remains private. For more information about Differential Privacy, you may view the paper here. What makes differential privacy so powerful is that we can adjust the amount of privacy risk using the privacy parameter ε. This parameter decides how much noise is added to the model.

Figure 6: Mathematical Definition of Differential Privacy

Bayesian Networks Synthetic Data Generation

The toolkit to generate differentially private synthetic data using Bayesian Networks is provided by the open-source model DataSynthesizer. It provides functionalities to sample data from Bayesian Networks equipped with the differential privacy mechanism. Bayesian Networks are probabilistic graphical models that represent the probabilistic relationships among variables. This is how Bayesian Networks are used to generate synthetic data.

1) The model learns the relationship between the attributes by computing their pairwise Mutual Information. In the computation of mutual information, noise is inserted into the distributions to ensure differential privacy guarantees. The maximal mutual information relationships are added to the construction of the Bayesian Network.

2) The constructed Bayesian Network allows us to generate the sampling order for the attributes. In this order, we then sample from the Bayesian Network to generate synthetic data.

Deep Learning Generation

Apart from the statistical method of generating synthetic data, SynPiper also offers deep learning methods to generate synthetic data. The goal of such methods is to generate high-fidelity synthetic data. To some extent, it mitigates privacy risks as synthetic samples are not exact copies of the real data.

The open-source package used for these deep learning methods comes from Synthetic Data Vault (SDV). Their library offers many different ways of generating synthetic data. The Conditional Tabular Generative Adversarial Networks (CTGAN) and Tabular Variational Autoencoders (TVAE) methods were chosen to be integrated into SynPiper because they are designed to address the challenges arising in tabular data such as imbalance categorical columns and multimodal distribution. The paper here discusses the adapted architecture of CTGANs and TVAEs from their original counterparts.

Brief Introduction to GANs

As the name suggests, GANs involve 2 neural network models training and pitting against one another. They are the Generator and Discriminator. The role of the generator is to create synthetic samples that resemble the original data distribution. Ideally, it aims to achieve a deterministic transformation that maps random noise (input to generator) into the distribution of the training data so that the discriminator is fooled. The discriminator acts as a binary classifier. It takes in both real inputs and synthetic samples generated from the generator and classifies whether the data is real or fake. Ideally, it aims to correctly classify training samples as real and generated samples as fake. For more information about GANs, this article summarizes it well. When the GAN is fully trained, we can use the generator to generate high-quality synthetic samples.

Brief Introduction to VAEs

Architecture of Autoencoders

The autoencoders architecture consists of two neural networks, an Encoder and Decoder. The encoder maps input data into a low dimensional latent space (bottleneck) and the decoder takes this encoded representation and tries to reconstruct it back into the original data. This bottleneck is deterministic, which means each input will be mapped to a fixed point in the latent space. This fixed point, when decoded, will always result in the same output.

Figure 7: Architecture of Autoencoders

Difference between Autoencoders and Variational Autoencoders

Rather than being deterministic, VAEs are probabilistic. The architecture contains a probabilistic encoder that maps input data into probabilistic distributions in the latent space. The model is thus capable of representing uncertainty in its encoding as it is no longer a fixed point. The uncertainty allows VAEs to be a generative model.

How synthetic samples are generated from VAEs

To synthesise new data, a sample is drawn from the distribution of the low-dimensional latent space. Next, it is passed through the decoder to map the latent variable back into the space of the original data, allowing us to generate synthetic data with distribution similar to the original data. We will repeat this process until sufficient samples are generated. For a more comprehensive understanding of VAEs, this article is excellent in explaining the intuition and maths behind VAEs.

Evaluation Criteria

Now that we have generated some synthetic data, we can evaluate the quality of synthetic data by assessing their Fidelity, Utility, and Privacy.

  • Fidelity: Measures how closely the synthetic data resembles its original data.
  • Utility: Assesses the effectiveness of the synthetic data in downstream models.
  • Privacy: Evaluates the level of protection of user-sensitive information in synthetic data.

Although there are 3 different ways to evaluate the synthetically generated data, in the current development — we focus on the Fidelity metrics. The evaluation of Utility and Privacy metrics will be in the next phase of development.

Fidelity

Figure 8: Overview of Fidelity Metrics

Utility

Train Real Test Real (TRTR) refers to the training of a machine learning model on real data and testing on an unseen validation set of real data. Train Synthetic Test Real (TSTR) refers to the training of the machine learning model on synthetic data and testing it on an unseen validation set of real data. The test results from TRTR will be used as a benchmark and compared against the test metrics from TSTR. If the score is relatively similar, this means that the synthetic dataset performed rather similarly to the real dataset, which is an indicator of good utility.

Figure 9: Utility Testing Framework (source: AWS Machine Learning Blog)

Future Work

Features that could enhance the effectiveness of Data Anonymizer include auto k-anonymity, which achieves k anonymous data without any configuration by the user. For SynPiper, further feature development could move towards integrating synthetic generators for time-series data, text data, etc. Additionally, new ways of evaluating privacy and utility could be incorporated into the program — this field is ever-evolving!

Photo by Jason Dent on Unsplash

Conclusion

My internship with DSTA has been very insightful — an enriching experience researching on- and developing deployable Synthetic Data and Data Anonymization packages. My mentors played a key role in my learning journey and I am grateful for their guidance in navigating the design and technical challenges that I encountered. Thank you DSTA!

Jingjie (second from right), with his fellow interns and mentors from Enterprise Digital Services’ Data Science team

--

--