Synthetic Data: Navigating Its Methodologies, Applications and Challenges

Published in

AI Practice GovTech

11 min readApr 12, 2024

✨ Image generated using DALL-E. Prompt: “A tree symbolising Generative AI for synthetic data generation in an oasis of data scarcity”. 🤓✨

It’s beyond doubt that our current technological era is defined by data-driven decision-making, with a pursuit of AI that possesses cognitive abilities, intelligence, and the capacity for experimental inquiry. The common theme is that these elements are crucial for innovations in domains ranging from climate change to supporting vulnerable and underprivileged communities. However, we frequently face a myriad of data bottlenecks: (i) privacy protection of sensitive data, along with regulatory and copyright concerns; (ii) scarcity; (iii) biases, imbalances, and unrepresentative data; (iv) impurity issues such as noisy and missing values; (v) cost-prohibitive acquisition; and (vi) at times, data that is non-existent.

Synthetic data emerges as a solution to alleviate these challenges, providing a means to mitigate risks tied to using sensitive data and steering the development towards more ethical and responsible AI applications. Furthermore, it unlocks novel opportunities for data monetisation strategies.

Synthetic data is artificial data and often generated by data synthesis algorithms, which replicate patterns and the statistical properties of real data across different modalities including text, images, tabular data, audio, and video.

Given its diverse applications driving innovation, synthetic data has garnered considerable attention across the tech industry. Anthropic harnessed synthetic data to fine-tune the capabilities of their Claude 3 models, and IBM leveraged synthetic data to enhance the task-specific knowledge of chatbots. Hugging Face’s Cosmopedia, the largest synthetic data to date, is an extensive archive boasting over 30 million files and 25 billion tokens, generated through prompt curation for model pre-training. Microsoft’s efforts to combat human trafficking were bolstered by a privacy-preserving synthetic data of victims and perpetrators. Amazon employed Generative AI to produce millions of synthetic palm images, thereby enhancing biometric recognition within their palm-scanning technology. The most recent development, at the time of writing, is OpenAI’s synthetic voice, which aims to assist individuals with learning disabilities and speech impairments.

These advancements align with Gartner’s forecast that by 2030, synthetic data will surpass real data in AI models, particularly due to the expansion of Large Language Models (LLMs). This heralds a new era in AI development where synthetic data becomes a cornerstone of innovation and application.

We’ll now move our discussion towards tabular data generation, as it is the most extensively used data type in enterprises. However, most of the content will still be modality-independent.

Synthetic Data Generation Methodologies

Synthetic data generation (SDG) has transitioned from statistical methods to the advanced realm of Generative AI. Statistical methods, such as bootstrap methods for resampling, parametric methods for data distribution assumptions, bayesian networks for probabilistic relationship modeling, and decision trees, were employed to create synthetic data that closely reflected the statistical properties of real data.

The advent of Generative AI modeling, including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Transformers, Normalizing Flows, and Diffusion Models, has marked a substantial leap in SDG. These models excel at learning from complex, unusual, and diverse data characteristics, including multimodal and heavily skewed distributions. Their ability to generate entirely new data, rather than merely extrapolating from existing data, has significantly broadened the potential for synthetic data across various applications.

A notable development has been the use of LLMs in SDG. Leveraging the deep contextual insights gained from extensive pre-training, LLMs can be fine-tuned on real data or undergo meticulous prompt engineering for data augmentation, marking a significant milestone in the evolution of SDG.

Meticulous prompt engineering to generate diverse textbook samples, leveraging the diversity of audiences (from left to right: young children, professionals, researchers, and high school students) and styles. Image taken from Hugging Face’s Cosmopedia blog: https://huggingface.co/blog/cosmopedia.

Applications

Some key applications of synthetic data includes:

Data Privacy

Synthetic data enhances privacy by generalising well enough to retain the real data’s statistical properties without disclosing individual records’ sensitive information. This aspect underpins SDG as a key privacy-enhancing technology, breaking 1-to-1 mapping and capturing the inherent data uncertainty and variability. While this reduces risks like singling out and linkability, vulnerabilities to privacy attacks like membership inference and attribute disclosure remain, necessitating the importance of ongoing vigilance and robust mitigation strategies.

Data Augmentation

With a trained model on real data, an unlimited number of data points can be generated that can significantly enhance the performance of downstream models. This is particularly beneficial for improving predictions in underrepresented groups, thereby ensuring more inclusive and robust analytical outcomes.

Fairness and Bias

Synthetic data, generated with fairness constraints during training or generation, can counteract biases inherent in real data. Such synthetic data can not only reflect desired fairness criteria but also aid in training more equitable machine learning models, achieving both representation fairness (ensuring all groups are adequately represented in the data) and algorithmic fairness (ensuring that the outcomes of models do not disproportionately benefit or harm any group).

Domain Adaptation

Domain adaptation in SDG involves using algorithms to combine large data with some issues (such as bias) with smaller, higher-quality data. The goal is to produce new, synthetic data that keeps the good qualities of the small data and applies them across the larger one. This process improves the overall quality of the data, making it more representative and fair, without the need to manually identify and correct the issues in the large data. While significant progress has been made in applying domain adaptation techniques to image data, extending these methods to non-image data, such as tabular or textual data, remains a challenge.

Data-driven Simulation

Use of generative models to create simulations based on real data, particularly when there’s no direct data available from the target domain. This approach is crucial for testing and understanding how machine learning models might perform under different, future, or hypothetical scenarios that have not yet been observed. The motivation behind using data-driven simulations is to anticipate how models will behave in evolving conditions or in environments different from where the initial data was collected.

The applications of this technology meet the needs of our public sector officers, as identified in our internal survey (see the image below) and ongoing engagements, reinforcing our decision to adopt and democratise synthetic data generation technology 💪.

Survey by GovTech’s data privacy team showed varied potential benefits of synthetic data use and interest across diverse Singapore government agencies. 🚀✨

Addressing The Key Challenges

We have identified two broad, key challenges concerning the truthfulness and privacy of synthetic data, primarily through our extensive empirical observations and ongoing engagements with government agencies. These findings come from: (1) our whitespace project, which focused on benchmarking tabular SDG tools from a local commercial solution provider — Betterdata.ai — and open-source packages like sdv, gretel-synthetics, synthcity, and ydata-synthetics. This project evaluated these tools for practical deployment across data of diverse characteristics and modalities, using comprehensive quality and privacy metrics; (2) our consultancy work with government agencies spanning several use cases, from privacy-preserving data sharing with stakeholders to using data augmentation for downstream machine learning tasks. The interactions have provided us with an understanding of the technology’s expectations that extend beyond the public sector.

Let’s talk about the two challenges and our recommendations to address those.

Truthfulness of Synthetic Data

Concerns regarding synthetic data primarily revolve around its ability to capture the nuances of real-world data and the implications this has for data-driven decision-making, which is crucial for gaining stakeholder acceptance. A significant challenge emerges when synthetic data must deal with messy, mislabeled, or biased inputs, potentially leading to flawed conclusions and unreliable predictions for specific subpopulations. Furthermore, the absence of standard and consistent metrics for evaluating synthetic data compounds these issues.

Commonly used metrics for fundamental evaluation (or smoke testing) of utility (usefulness, including biases and imbalances correction) and fidelity (statistical retention) provide a high-level and insufficient assessment of the quality of synthetic data. Metrics should also provide insight into different aspects of synthetic data particularly at the individual sample-level — diversity (the extent to which these samples cover the full variability of real samples) and generalisability (the extent to which a model overfits training data). Crucially, evaluating synthetic data at a more granular level can help identify potential vulnerabilities or misrepresentations in specific subgroups or data features. Moreover, metrics should be interpretable and understandable to downstream users, allowing them to make informed decisions about whether to trust the analysis derived from synthetic data and to navigate trade-offs between competing metrics (e.g., privacy vs. utility).

While model selection plays a significant role in the quality of synthetic data, another reasonable approach is to profile the real data to guide the SDG process (such as categorising samples as easy to learn, ambiguous, or hard, which can be proxies for data issues like mislabeling, data shift, or under-represented samples) that better reflects the real world data.

The emerging risk of “model collapse” presents a challenge in maintaining the truthfulness and utility of synthetic data
As Generative AI increasingly generates web content, foundational models relying on public sources for current and relevant data face a unique challenge known as ‘model collapse’. This term refers to the degradation of AI models that occurs when they are repeatedly trained on data produced by other models. This recursive training can result in the perpetuation of models ‘hallucinating,’ or overemphasising common patterns while underrepresenting rarer ones, leading to unrealistic synthetic data generation. However, experts like Andy Pernsteiner, CTO of VAST Data Inc., suggest that with diligent care and oversight, well-curated foundational models are less likely to succumb to model collapse.

Model Collapse refers to a degenerative learning process where models start forgetting improbable events over time, as the model becomes poisoned with its own projection of reality. The *image and caption are from “The Curse of Recursion: Training on Generated Data Makes Models Forget” research paper.*

Balancing the Privacy Utility Trade-Off in Synthetic Data

So far, we understand how synthetic data can address various data bottlenecks. However, the generation of synthetic data comes with its own privacy risks. These include adversarial machine learning attacks — such as memorisation, model poisoning, and model inversion — as well as concerns specific to the synthetic data itself, like attribute inference, the potential for exact or approximate replication of real data, and the presence of outliers. We identified two key areas to address privacy issues while considering data utility:

Evaluating Beyond Similarity/Distance Metrics: The general reliance on similarity and distance metrics for privacy risk assessment — measuring mean absolute error between real and synthetic records or counting identical records across datasets — may not intuitively convey privacy risks to users. Additionally, global data protection regulations, such as the EU’s GDPR, highlight the need to mitigate three principal risks to achieve sufficient anonymisation: singling out (isolating individuals), linkability (linking data to identify individuals), and inferences (deducing attribute). Tools like Anonymeter can aid in evaluating these risks, potentially helping data controllers intuitively understand the residual risk of re-identification. However, akin to traditional anonymisation techniques, careful consideration is necessary to avoid limiting the data’s analytical value, as overly stringent protections could compromise the core of AI and data analysis.

Combining with Other PETs (privacy-enhancing technologies): Simple approaches, such as preprocessing real data before model training and processing the synthetic data to remove records too closely aligned with the real data — using techniques like suppression and generalisation — can help minimise privacy risks. Another strategy involves training generative models with Differential Privacy (DP), which ensures the model remains indistinguishable regarding the inclusion of any individual’s data, thus averting re-identifiability attacks, even from resourceful and strategic adversaries. Moreover, DP does not assume any specific characteristics about the adversary and provides provable mathematical guarantees applicable in worst-case scenarios (e.g., possessing knowledge of the training algorithm, having strong computing power, etc.). But, DP comes with trade-offs, potentially reducing utility in ways that protect outliers and underrepresented groups. Additionally, choosing an appropriate privacy budget and DP mechanism is a nuanced decision that depends on the context.

Given the multifaceted nature of data privacy, no single PET can fully mitigate privacy risks; this applies to SDG as well. The unpredictable characteristics of synthetic data require a balanced approach that accounts for privacy, fairness, and potential biases — important because more than 20 types of bias can be transferred from real data. Thus, a holistic strategy that assesses risks and adopts effective measures is crucial for the responsible development of synthetic data, ensuring a balance between privacy, utility, and fairness.

Conclusion

In the swiftly evolving world of AI, as we have been witnessing, SDG has emerged as a key solution to challenges like data scarcity, ethical concerns, and privacy issues. The evolution from statistical methods to advanced Generative AI, including techniques like GANs, VAEs, and LLMs, broadens SDG’s applications. The integration of LLMs, in particular, marks a paradigm shift, leveraging extensive pre-training to generate synthetic data. Yet, challenges remain, particularly in balancing privacy with utility and ensuring the truthfulness of synthetic data.

Synthetic Data Generation is not just reshaping the landscape of AI but also propelling us towards a future where ethical, responsible AI and data democratisation become a reality.

Singapore’s data privacy regime and AI governance and democratisation initiatives, like the rest of the world, are constantly evolving, encompassing a diverse array of laws, guidelines, and recommendations. Aligning with these initiatives, GovTech’s Data Privacy Protection Capability Center takes an active role as an enabler by working on Synthetic Data Generation. This emerging technology is not just a focus of innovation but a key tool in our aim to make advanced solutions accessible to every public officer, thereby effectively meeting their data needs.

Interested in learning more about our work or exploring a collaboration? We’d ❤ to hear from you! Please reach out to us at cloak@tech.gov.sg.

Thanks,
Ghim Eng Yap and Alan Tang for the valuable inputs.

Author: Anshu Singh

References

https://research.ibm.com/blog/LLM-generated-data
https://huggingface.co/blog/cosmopedia
https://www.microsoft.com/en-us/research/blog/iom-and-microsoft-release-first-ever-differentially-private-synthetic-dataset-to-counter-human-trafficking/
https://www.aboutamazon.com/news/retail/generative-ai-trains-amazon-one-palm-scanning-technology
https://openai.com/blog/navigating-the-challenges-and-opportunities-of-synthetic-voices
https://www.linkedin.com/pulse/data-monetization-synthetic-betterdataai-vklxf/
https://gretel.ai/gdpr-and-ccpa
https://gretel.ai/blog/introducing-gretels-privacy-filters
https://www.priv.gc.ca/en/blog/20221012/?id=7777-6-493564
https://github.com/statice/anonymeter
https://blocksandfiles.com/2023/09/25/vast-data-model-collapse-and-the-coming-data-arms-race/
https://www.priv.gc.ca/en/blog/20221012/?id=7777-6-493564
https://www.betterdata.ai/
van Breugel, Boris, and Mihaela van der Schaar. “Beyond privacy: Navigating the opportunities and challenges of synthetic data.” arXiv preprint arXiv:2304.03722 (2023).
Hansen, Lasse, et al. “Reimagining synthetic tabular data generation through data-centric AI: A comprehensive benchmark.” Advances in Neural Information Processing Systems 36 (2023): 33781–33823.
Van Breugel, Boris, Zhaozhi Qian, and Mihaela Van Der Schaar. “Synthetic data, real errors: how (not) to publish and use synthetic data.” International Conference on Machine Learning. PMLR, 2023.
Stadler, Theresa, Bristena Oprisanu, and Carmela Troncoso. “Synthetic data–anonymisation groundhog day.” 31st USENIX Security Symposium (USENIX Security 22). 2022.
Giomi, M., Boenisch, F., Wehmeyer, C., & Tasnádi, B. (2022). A unified framework for quantifying privacy risk in synthetic data. arXiv preprint arXiv:2211.10459.
Ganev, Georgi. “When synthetic data met regulation.” arXiv preprint arXiv:2307.00359 (2023).
Alaa, Ahmed, et al. “How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models.” International Conference on Machine Learning. PMLR, 2022.APA
Shumailov, Ilia, et al. “The curse of recursion: Training on generated data makes models forget.” arXiv preprint arXiv:2305.17493 (2023).
Jordon, James, et al. “Synthetic Data — what, why and how?.” arXiv preprint arXiv:2205.03257 (2022).
Bellovin, Steven M., Preetam K. Dutta, and Nathan Reitinger. “Privacy and synthetic datasets.” Stan. Tech. L. Rev. 22 (2019): 1.