Cumulative Compression, Generative AI, and the Altering Eye

Dataset incest as a challenge for future generations of generative AI models

A.G. Elrod
Brass For Brain
5 min readOct 22, 2023

--

A white, classic Dutch windmill under a blue, semi-cloudy sky. Beneath the windmill is a field covered in yellow daffodils.
Daffodils and a classic Dutch windmill in Middelburg, Netherlands. Original JPEG file. Image source: author.

For the eye altering alters all.
William Blake — “The Mental Traveller”

Ted Chiang’s View on Text-generation AI Models

In February 2023, not long before the release of GPT-4, the award-winning sci-fi author and computer scientist Ted Chiang published an article in the New Yorker intended to explain the current state of text-generation AI models. The title of the article was “ChatGPT is a Blurry JPEG of the Web.”

The article argues that LLMs like ChatGPT should be thought of as lossy compression algorithms rather than true artificial intelligence. Just as JPEG images discard data to save space, these models analyze patterns in text to generate new plausible-sounding but often fabricated responses. This explains why these models sometimes hallucinate convincing but inaccurate information. Their output resembles a blurry photocopy — good enough to seem accurate at first glance but lacking the precision and truth of the original. So, while entertaining, they are unreliable for many real-world applications compared to search engines or human writers. Chiang concludes that we should be cautious of thinking of these models as true “artificial intelligence” and careful with overestimating their abilities.

While debatable, Chiang’s point is well-taken insofar as it’s a helpful metaphor for understanding the remarkable capabilities and limitations of these models. However, it may be useful for us to go deeper into this metaphor and consider how this compression concept could affect future generations of generative AI models.

The same JPEG image as earlier, but save 10 times using JPEG compression, each time reducing the file size by around 0.5%.

From Pure Human Output to a Hybrid Internet Landscape

Prior to the cutoff date of GPT-3’s training, the datasets with which such LLMs were trained could be considered a sort of RAW image file. It was an image of all but completely pure human output. The text-generation models of the time were comparatively primitive and not of interest to the general public. If something was written in a magazine, a peer-reviewed journal, a book, or anywhere on the internet, it was considered human output. There was, for good reason, no need to assume that content may be the product of a machine.

Fast-forward three years, and a myriad of AI-composed books, academic journal articles, advertisements, and SEO content are added daily to our collective body of work. The internet landscape is quickly evolving into a hybrid of human and AI-generated content. In essence, the original is intermingling with the synthesized, and we are, for the moment, ill-equipped to distinguish between them.

The same image saved and compressed 50 times.

If the process ended here, some damage would be done (injury to our faith in the written word, for instance), but it would be relatively minimal.

Now, imagine the dataset used to train the next generation of these models. The text used will no longer be the pure, uncompressed, original product. Instead, it will inevitably be a product saturated with countless echoes of human content — AI-generated works that emulate human patterns. In a very real sense, compressed copies of compressed copies.

The same image saved and compressed 100 times.

Putting aside the impact on our trust in our senses and the written word, we will contend with the implications of dataset incest. Something is lost or changed with each tainted generation. And it is plausible that that cumulative process will increase the rate of change with time.

The same image saved and compressed 150 times.

The Rapid Evolution and its Repercussions on AI Learning

As someone who studies and teaches languages, I am fascinated by how they evolve. Over time, subtle shifts in patterns and habits accumulate, leading a language to develop into a form that might be unrecognizable to its earlier speakers. This natural progression, however, experienced a significant deceleration with the invention of the printing press and, subsequently, the rise of mass media. Such advancements have solidified words in time, providing consistent references for learning, citation, and reflection.

Today, more than ever, our understanding and mastery of language are shaped by written content. The voices and styles that influence us most are often those of respected authors, peers, educators, and friends. At the same time, generative AI models undergo a similar learning process. They assimilate language based on our collective style, drawing from the vast expanse of internet content. However, while we tend to learn in-depth from select individuals or sources, these models cast a broader net, absorbing linguistic nuances from the collective entirety of online discourse.

One might liken the process to lossy compression in the realm of AI learning. Just as lossy compression algorithms trim redundant or less significant data to reduce file size — sometimes at the cost of accuracy and quality — these AI models, in their bid to generalize from the vast internet, might lose the intricacies and idiosyncrasies of specific linguistic sources. The result? A generalized output that, while reflective of the broader linguistic landscape, might sometimes lack the accuracy, depth, and richness of its source material. Now, multiply that effect over multiple generations.

The same image saved and compressed 200 times.

The Importance of Distinguishing Human from AI Voices

Heeding Blake, we do well to remember that “the eye altering” indeed “alters all.” As we continue our steady march into this new era, the clear delineation between human and AI-generated content becomes paramount. The intertwining of these voices invites the challenge of dataset incest, where AI doesn’t merely learn from the pure human narrative, but increasingly from its own previous iterations. This recursive blending threatens to obscure our original voice, reminiscent of the degradation observed in cumulatively compressed images. Just as it’s pivotal in biology to uphold genetic diversity, it’s equally crucial in our digital discourse to safeguard the unadulterated resonance of the human voice. It’s important that we remain vigilant in our perception. After all, as the eyes through which we perceive our world alter, so does the world itself.

--

--

A.G. Elrod
Brass For Brain

International educator and researcher of AI Ethics and Digital Humanities: I believe in looking at today's innovations through the lense of ancient wisdom.