Model corruption: Risks of AI-Generated Data in ML

Anyverse
Anyverse™
Published in
4 min readDec 14, 2023

AI-generated data is in the spotlight these days. The widespread acceptance of generative AI tools capable of producing realistic images or text, exemplified by technologies like DALL-E, MidJourney, Stable Diffusion, ChatGPT, or Leonardo has elevated the societal consequences of these advancements to the forefront of public discussions. As exemplified in the papers cited throughout the article, the reality is data gathered through generative AI is not free of uncertainty, but above all many doubts about its negative impact on model training.

AI-generated data’s influence on model integrity

If we focus on the area that interests us, computer vision, and therefore image data, we can say that these tools are possible due to the extensive number of images on the Internet. At the same time, these generative AI tools become image generators, thus expanding the data sources available to train machine learning models. However as presented in the recent paper Towards Understanding the Interplay of Generative Artificial Intelligence and the Internet, this new approach to generating data raises several unknowns.

As a result of the mentioned above, upcoming iterations of generative AI tools will undergo training using a combination of content crafted by humans and that generated by AI, creating a potential feedback loop between generative AI and public data repositories. This dynamic prompts numerous inquiries: How will upcoming iterations of generative AI tools perform when trained on a blend of authentic and AI-generated data? Will the new datasets be improved or on the contrary will they be corrupted? Could the evolutionary process introduce biases or diminish diversity in subsequent generations of ML models and generative AI tools?

Researchers in charge of this paper explored this interaction’s impact and presented some preliminary findings derived from employing basic diffusion models trained with several image datasets.

The outcomes indicated a degradation in both the quality and diversity of the generated images over time, implying that the inclusion of AI-created data may lead to undesirable effects on subsequent versions of generative models.

AI-generated data’ reliability questioned

Another research conducted collaboratively by Stanford and Berkeley Universities reached similar conclusions. The paper delves into the consequences of training generative systems using diverse combinations of human-generated and AI-generated content.

Despite the early stage of generative AI evolution, there is already evidence suggesting that retraining a generative AI model on its own creations, termed “model poisoning,” results in a spectrum of artifacts in the output of the newly trained model. Notably, large language models (LLMs) exhibit irreversible defects, leading to the production of nonsensical content — referred to as “model collapse.”

In the realm of image generation, studies demonstrate that retraining StyleGAN on its creations produces images with visual and structural defects, with a notable impact on output as the ratio of AI-generated content used for retraining varies from 0.3% to 100%. The vulnerability extends beyond GAN-based image generation to diffusion-based text-to-image models, as seen in research showing image degradation and a loss of diversity when retraining on one’s own creation.

Expanding on prior studies, the paper highlights the vulnerability of the widely used open-source model Stable Diffusion to data poisoning. Through iterative retraining on self-generated faces, the model experiences an initial modest improvement but swiftly collapses, yielding highly distorted and less diverse faces. Intriguingly, even with only 3% of self-generated images in the retraining data, the model collapse persists.

The study further explores the extent of model poisoning beyond the prompts used for retraining and investigates the potential recovery of the poisoned model through further retraining solely on real images. This comprehensive exploration sheds light on the intricate challenges and consequences associated with the interplay between generative AI training and its own outputs.

The lack of accuracy in AI-generated data raises questions

It seems that there is still a long way to go to demonstrate the viability of generative AI to craft data to train new models, in addition to the impossibility of generating all the necessary metadata beyond images that appear realistic to the human eye.

Especially when it comes to training models that will be implemented in critical applications, such as autonomous driving or in-cabin monitoring systems. Here the “accuracy” dimension is fundamental and becomes more important than ever. Without going any further, the European Union has already prepared a first proposal for a regulation on artificial intelligence. This proposal begins to detail how the training data of the models must be accurate enough to be considered valid for training.

Synthetic data is an impressive source to continue advancing and refining computer vision models, but it is clear that developers must be careful when evaluating and selecting the method to gather this data, all synthetic data is not equal and all synthetic data is not equally valid for training models that require data accuracy.

--

--

Anyverse
Anyverse™

The hyperspectral synthetic data platform for advanced perception