Model Collapse: When AI eats itself…

Sunil Manghani
Electronic Life
Published in
5 min readNov 20, 2023

--

Scientific American TikTok: ‘…in a world increasingly flooded with generated content AI could end up being the snake that swallows its own tail’.

How can we prevent model collapse in AI systems and ensure they remain accurate and unbiased? Short answer: Humans required.

In ‘Automatic Art & a Post-Knowledge Economy?’, I speculated on the idea of a ‘post-knowledge’ economy — i.e. one that goes beyond merely accumulating wealth from knowledge-based services (as opposed to physical goods and labour-based services), to an AI-driven economy that produces knowledge ‘effortlessly’, seemingly bypassing human input.

We seem be at a ‘pivotal moment’, I wrote, ‘with AI poised to take us into a post-knowledge economy’. The weak form of this scenario would involve ‘increasing adoption of AI models and tools, but only persisting with the parroting of all that it is already archived, without necessarily contributing anything new’. This ‘cut and paste’ scenerio is arguably already the position we are in, ‘taking us back to the postmodern debates of simulation (as in the film The Matrix)’. Indeed, one of the criticisms of generative AI is that it merely imitates or parrots all that has been previously said. As such, ‘it is only as good as the human record upon which it draws (the massive, but partial archive of images and texts that self-supervised AI can trawl through)’.

There is another case to be made, I added, whereby ‘there is a generative account of information (such as the pictures churning out of DALL-E 2 and other such models), which act as new creative inputs in a site of exchange that is no longer the preserve of ours alone’. This is the strong form of post-knowledge economy, which admittedly I only hinted at, but which has a certain inevitability about it. And a critical problem is at stake: AI begins to eat itself!

I concluded my article with the idea of the post-knowledge economy as ‘the harbinger of a new kind of knowledge, one that might not be the sole preserve of the human, and which represents a massive (and massively open) general intellect’. The problem, however, is that as we become awash with AI generated texts and images, a major statistical error creeps in. Speaking on the Scientific American TikTok account, Sophie Bushwick explains:

When AI trains on AI generated data it can introduce errors that build up with each iteration. In a recent study, researchers started with a language model trained on human produced content and then they fed it AI generated text over and over again. By the 10th iteration … when they asked it a question about English historical architecture it spewed out nonsense about jackrabbits. This phenomenon is called model collapse.

‘…when asked a question about English historical architecture out spewed nonsense about jackrabbits’.

The study (Shumailov, et al., 2023) used a small model, but the problem is not confined to any single type. It is evident in a range of AI systems including Gaussian Mixture Models, Variational Autoencoders, and Large Language Models like GPT-4. In short, the problem persists when trained on data produced by preceding models. Over time, widespread AI generated content can lead to the gradual deterioration of model quality. As AI-generated content gets reabsorbed into training datasets, the models loose touch with genuine, human-produced data. This phenomenon underscores the growing importance of having access to real, human-generated data to maintain the integrity and effectiveness of AI systems​​.

A key strategy to prevent model collapse is ensuring access to original, human-generated content. Thus, the study highlights the necessity of incorporating genuine human interactions and data into AI training processes to sustain the benefits of these models and avoid the pitfalls of a closed-loop system where AI only learns from its own outputs​​. Yet, surely a problem emerges: we are fast struggling to distinguish between AI and human-generated materials.

And there is further cause for concern: The problem of model collapse is most acute with data that is less common. Again, Sophie Bushwick explains:

…when models collapse they’re more likely to lose this rare data that’s further from the norm. So researchers fear that this could make the problem of AI bias against marginalised groups even worse. One way to avoid model collapse could be to use only human curated datasets. But in a world increasingly flooded with generated content AI could end up being the snake that swallows its own tail.

One way to avoid model collapse could be to use only human curated datasets’.

If we think back to Donna Haraway’s seminal ‘Manifesto for Cyborgs,’ we encounter a framework that challenges conventional boundaries between human and machine. Her work, ironically, was admired for its ‘humanity’, striving for new connections amidst old loyalties, offering an alternative lens through which we might now view the evolution of AI and the emerging post-knowledge economy. More recently, writing after Chat GPT, Hito Steyerl’s ‘Mean Images’ provides a critical analysis of how AI-generated content, far from being neutral, reflects and amplifies existing societal biases and ideologies. Steyerl’s account is a critique of the statistical nature of AI outputs, highlighting how they represent averages (i.e. ‘mean’) and probabilities rather than truths or realities.

Whether we still await Haraway’s cyborgian moment (the blending of human and machine, the interplay of data and flesh), or we resign ourselves to the ‘mean’ average of calculable content, a further twist in the tail is the strange, jackrabbit surrealism that might await us if we just let the machine keep on churning (akin to Stanislaw Lem’s The Cyberiad). The rise of AI — our impending new ‘electronic life’ — is not just a technological revolution but also a labour and economic one. But perhaps not quite as we might imagine. Returning to our starting point of the post-knowledge economy, the irony is stark: as we lean more into artificial labour to prop up a new economy, the necessity for human-generated and human-farmed data will demand ever more human labour. Put another way: The need to prevent AI from descending into nonsense requires human intervention. This labour, no doubt precarious and underpaid, presents a paradox where technology supposed to lessen human workload increases it in unseen ways.

References

Emily Bender et al., (2021) ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?’, FAccT ’21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, March 2021, pp. 610–623.

Donna J. Haraway (1991) ‘A Cyborg Manifesto: Science, technology, and Socialist-Feminism in the Late Twentieth Century,’ in Simians, Cyborgs, and Women: The Reinvention of Nature. Routledge, pp.149–181.

Stanislaw Lem (2014) The Cyberiad. Penguin.

Ilia Shumailov, et al. (2023) ‘The Curse of Recursion: Training on Generated Data Makes Models Forget’, arXiv:2305.17493 [Computer Science > Machine Learning, 27 May 2023 (v1), last revised 31 May 2023 (v2)].

Hito Steyerl (2023) ‘Mean Images’, New Left Review, No.140/141 March–June 2023, pp.82–97.

--

--

Electronic Life
Electronic Life

Published in Electronic Life

Encounters with Electronic Life brings together speculative writings on AI, data culture, imaging and the ‘evolution’ of information.

Sunil Manghani
Sunil Manghani

Written by Sunil Manghani

Professor of Theory, Practice & Critique at University of Southampton, Fellow of Alan Turing Institute for AI, and managing editor of Theory, Culture & Society.

Responses (2)