The promise of synthetic data for AI

Synthetic data holds immense potential to solve AI’s data privacy, quality and availability challenges

Mike Mullane
e-tech
6 min readJul 6, 2023

--

Synthetic data holds immense potential to drive advancements in artificial intelligence (AI) across a wide range of domains. In healthcare, for example, where access to sensitive patient data is restricted, synthetic data is already being used in the research and the development of innovative treatments. Alongside the remarkable promise, however, there are also significant challenges, such as ensuring accuracy and safeguarding privacy.

Synthetic data enables healthcare organizations to tap into concrete and representative insights from sensitive data while minimizing the risks to patient privacy and reducing governance requirements. By generating realistic medical datasets, researchers are able to train AI models to analyze and diagnose diseases, simulate clinical trials and optimize treatment plans without compromising patient privacy.

Finance is another field where synthetic data is having a major impact. Financial institutions rely on vast amounts of historical data to develop risk models, detect fraudulent activities and make informed investment decisions. Synthetic data helps address the challenges of limited and sensitive financial data availability, enabling the creation of synthetic datasets that closely mirror real-world financial transactions. These datasets empower researchers and financial analysts to explore new strategies and enhance their decision-making processes.

Beyond healthcare and finance, synthetic data has applications in transport, autonomous vehicles, cyber security and many other domains where large-scale, diverse datasets are critical. It is able to serve as a bridge between limited or inaccessible real data and the need for comprehensive, representative datasets to train and test AI models effectively.

Synthetic data is also used to augment existing datasets by generating additional samples that capture a broader range of scenarios, variations, or outliers, or simply to provide a large enough dataset to train a machine-learning model. This helps to improve the robustness and generalization capabilities of AI models.

Crucially, synthetic data empowers researchers and developers to create controlled and repeatable experiments, exploring different scenarios and evaluating the performance and behaviour of AI models under various conditions by manipulating parameters and characteristics during the synthetic data generation process.

What is synthetic data?

Synthetic data refers to artificially generated data that mimics real-world information. It can be derived from existing data or generated purely from algorithms or mathematical models.

The process of generating synthetic data from existing data through the removal of references to sensitive information is often referred to as anonymization. One technique, fuzzing, introduces small random variations in values, to prevent the identification of specific individuals.

The process of creating synthetic data using algorithms, known as Generative AI, has garnered a great deal of press and attention recently. These methods involve training models to generate new data samples that closely resemble the distribution of the original data. Generative AI techniques include generative adversarial networks (GANs) where two neural networks compete in a zero-sum game where one network’s gain is the other's loss. Generative pre-trained transformers (GPTs) and especially large language models (LLMs) such as ChatGPT and Google Bard have gained widespread attention recently. These LLMs use short natural language prompts to generate text and images that are practically indistinguishable from human creations. Other approaches involve rule-based algorithms, simulation models, or data augmentation techniques that modify existing data samples to create synthetic variations.

Although synthetic data offers numerous advantages, ensuring its quality and fidelity to real-world data remains essential. The success of synthetic data usage hinges on achieving accuracy and realism in capturing statistical patterns and relationships present in the original data. LLMs present particular challenges in this regard, as their responses are not completely deterministic. In other words, the same prompt can yield different responses, making validation difficult. Furthermore, LLMs often “hallucinate” and fabricate false information, as they work by predicting the next word in a sequence.

The limitations of synthetic data

Creating synthetic versions of medical datasets poses unique challenges. These multifaceted datasets can encompass doctors’ notes, X-rays, temperature measurements, blood-test results and more. While experienced doctors can analyze all these factors to make a diagnosis, machines currently lack the ability to gather information from multiple sources comprehensively.

Another major challenge for synthetic data is ensuring privacy protection. Strict regulations like the Health Insurance Portability and Accountability Act (HIPAA) in the United States and the European Union’s General Data Protection Regulation (GDPR) mandate safeguarding sensitive medical data. One approach to enhance privacy is by adding statistical noise to datasets, which makes it harder to identify individuals. The problem is that this also affects the accuracy of the results. Finding the right balance between utility and privacy is a delicate task because increasing one will inevitably reduce the other.

Ethical and legal implications

It is fundamentally important to consider the ethical and legal implications of using synthetic data to avoid potential biases or misrepresentations. There are already concerns about generative models and the data used to train them. Issues include potential violations of intellectual property rights associated with the generated data, as well as instances where proprietary training data unintentionally surfaces in generated results. Another challenge is that many of these LLMs are trained on massive amounts of data from the internet. As the amount of synthetic data in the form of auto-generated blogs, chats, and images proliferate on the internet the chance it is used to train new models increases dramatically resulting in the reinforcement of biases and the incorporation of disinformation and inaccurate data into the models.

As the adoption of synthetic data grows, ethical considerations become paramount. Transparency and the responsible use of synthetic data are crucial to avoid misleading results or perpetuating biases. An unethical use may be to pollute the training data to bias a model to a specific result. It is essential to understand the limitations, potential biases, and the provenance of the original data from which synthetic data is derived. This understanding ensures that biases and inaccuracies are not amplified or perpetuated in the synthetic datasets.

Machine learning algorithms are susceptible to biases present in the training data, which can result in discriminatory outcomes. Synthetic data, when carefully designed, can eliminate or minimize biases, promoting fairer and more equitable AI systems. Researchers can counteract biases that may exist in the original data and promote algorithmic fairness by intentionally creating diverse and balanced synthetic datasets.

One of the biggest challenges in training advanced AI models, especially for more structured data such as financial and medical is the availability of large, labelled data sets. The labels are the categories or answers that apply to a specific data record. Synthetic data is a transformative approach that can generate large quantities of labelled data quickly and cost-effectively. This speeds up the training process and enables faster iterations and experimentation, ultimately advancing the pace of AI research and innovation.

Continued research and development efforts will be crucial to enhance the accuracy, realism and diversity of synthetic datasets. Interdisciplinary collaboration and open dialogue are vital to address the ethical challenges surrounding synthetic data and ensure its responsible and beneficial application across various domains.

The development of clear guidelines and regulations for synthetic data usage and sharing is necessary to strike a balance between innovation and protecting individual privacy. Collaborative efforts between researchers, policymakers and industry stakeholders are essential to establish best practices and standards that govern the generation, evaluation and use of synthetic data. This is already happening in the IEC and ISO Committee on AI (SC 42).

Standardization work on AI

Earlier this year, there was a lively discussion about synthetic data at SC 42’s spring plenary meeting, in Berlin. It ended with delegates approving work on a new ISO/IEC Technical Report (TR) about synthetic data to be published next year. The SC 42 project will likely aim to identify best practices for the generation, evaluation and use of synthetic data in AI systems. This can help promote the responsible and effective use of synthetic data while addressing privacy concerns and improving the availability and diversity of data for AI research and development.

“Ensuring broad responsible AI adoption requires an ecosystem of standards,” said Wael William Diab, Chair of SC 42. “Novel standards projects such as those in our data program of work that include synthetic data is a reflection of the dynamic nature standards are taking to address emerging requirements and social issues.”

“The ability of AI systems to use synthetic data unlocks a tremendous potential for applications,” said David Boyd, conveyor of the data working group under the joint IEC and ISO committee on AI. “The use of synthetic data addressed one of the major challenges of training advanced models in a practical and ethical way.”

SC 42 develops international standards for artificial intelligence. Its unique holistic approach considers the entire AI ecosystem, by looking at technology capability and non-technical requirements, such as business and regulatory and policy requirements, application domain needs, and ethical and societal concerns.

--

--

Mike Mullane
e-tech
Editor for

Journalist working at the intersection of technology and media