Forget World Models; Data Dignity is the Real Challenge for Generative AI

12 min readDec 1, 2023

October 2023 was weird; I found out four months too late that my prototype AI Art Detector had been featured in the New York Times, and then the very same week, had a hit piece written about it by some of my favorite tech journalists at 404 Media. I was burned out on synthetic media detection, and decided to play with the idea of doing generative AI a different way, respecting intellectual property and the dignity of creative humans.

This is a topic I’ve been wrestling with for a long time. When me and my Internet buddies were building and running shitposting Reddit bots back in 2020–2021, it might not have been the best use of my time and talents, but it was, undoubtedly, fair use of the copyrighted material we knew was part of the training dataset for language models like GPT-2 and GPT-J. These were non-commercial research artifacts, and our use of them was for self-education and personal entertainment. It’s still fair use, but the industry around us has changed, and now commercial applications of large language models are all the rage. Playing with LLMs for fun has started to feel disrespectful to writers striking in Hollywood, or any number of other casualties of AI hype. My Instagram feed is inundated with creepy influencer ads telling me about some foolproof way to write and publish a best-selling book with ChatGPT, and I wish it would just go away.

In this context, November 2023 might seem like a strange time to jump on the National Novel Generation Month bandwagon, but the event explicitly has on its homepage the request that participants “Please try to respect copyright.” The annual event has been going on since before the GPT craze, and so when that plea was first made, it probably didn’t seem so surprising. But in the AI-polarized world of 2023, it’s a call to action: make generative art a different way. But what should that “different way” look like?

Before the press attention on the AI art detector, I had been mulling over a return to running AI bots on Reddit, but had decided that if I was going to do it, I should make some effort to avoid using copyrighted material. This in itself would be a challenge, because if I were to be absolutist about that, I would need to find a base model which didn’t include open web scrapes in its training data. (As a side note, it’s funny that companies like Meta, MPT and Mistral are touting their open-source chat models as “licensed for commercial use” when what they really mean is that they haven’t generating instruction fine-tuning data using ChatGPT— their base models contain the same amount of copyrighted text as almost everyone else’s… not to mention the irony in OpenAI’s Terms of Service forbidding use of their output to train models when the US government has repeatedly refused to assign copyright to generative AI outputs.)

It’s widely assumed that it has to be this way — that if you’re too careful about not cutting down wildflowers while harvesting the great pastures of the Web, you won’t get satisfactory results. However, a few experiments suggest otherwise:

The StarCoder effort, led by BigCode, filtered out repositories that were not permissively-licensed from training data. The resulting models nonetheless delivered reasonably capable coding assistance (I’ve used them and can detect no significant difference in quality from Copilot).
Microsoft’s Phi series of 1–2 billion parameter models were trained on textbooks generated by a larger language model, and perform favorably on many benchmarks compared to larger models. This skirts the edge of data dignity: AI outputs are technically non-copyrightable, so Phi does not directly violate any copyright restrictions, though its “teacher” model might have done so. On the one hand it’s merely an example of model distillation, but on the other it hammers home the point that data quality makes a good language model, not data quantity. One might hope that there is enough high-quality public domain and open-licensed work to obviate indiscriminate scrapes of the web en masse.

Comparing these two starting points, Phi has more natural language ability, but StarCoder has the practical advantage of a C++ inference library (starcoder.cpp), similar to llama.cpp and supported by some of the same client-side programs and APIs that have sprouted up around LLAMA following its leak to the public. Huggingface has also published some fine-tuning code (though I had to make some tweaks to get it to work). The question then becomes: can a coding LLM learn how to write?

The answer is: yes, but not quickly. When I fine-tuned StarCoder on the kind of Reddit conversation thread data we traditionally used for GPT-2 bots, it struggled to understand our unique system of “tagging” post title, thread, and reply content (basically, special tokens), though otherwise the generated text was similar enough to the training data. Thinking that the tags might be adding too much complexity, I tried fine-tuning StarCoder on a corpus of Cthulhu Mythos texts from Wikisource. After a brief dip, the eval loss rose as training loss continued to decline: a classic symptom of over-fitting. It seems that, though the Github repositories in the Bigcode data contain many samples of normal human language, there may still be too much of a leap from programmer discussions to weird fiction.

Of course, the term “public domain”, when applied to literature, mostly means that authors have been dead for long enough that we don’t need to ask anyone’s permission to use their work, not that they’ve actually consented in any way: in fact the dead can’t consent to anything. I’m not sure if Lovecraft would be intrigued or horrified by my experiment (though I sort of don’t care about his opinions, since he was also an anti-immigrant racist). In a way it’s more ethical to take a contemporary group of writers and do what is unthinkable for most tech companies: ask permission to use written work for training a language model.

I was already fine-tuning StarCoder on subreddits related to current strains of cosmic horror such as the SCP and Backrooms copypasta genres, both of which are based around very active (and highly structured) fandom wikis. If I could include some of the wiki content in my fine-tuned StarCoder, it would be vastly more interesting: imagine synthetic Backrooms levels haunted by never-before-described Entities. All of the Backrooms and SCP wiki content is under a Creative Commons license, the conditions of which I could easily satisfy by publishing the training dataset to Huggingface and citing the wiki article authors. And such due diligence would be on top of the fact that my use of their work would already be considered fair use in a court of law, since it is non-commercial, for self-education and entertainment purposes! Still, I wanted to ask the authors’ permission first, to find out what their reaction would be. Is it as uncomfortable a conversation as techies probably assume? My goal wasn’t just some sort of legalistic compliance —my aim was to see how far we can go in the direction of data dignity while still creating generative media.

Initial encounters were not discouraging: I found the Discord servers on which Backrooms fandom writers hang out, and posed a general question to those present as to whether they would consider it “good enough” if I met the attribution requirements of the Creative Commons license. I was very careful to explain that I was sympathetic to misgivings that writers might have towards generative AI, and that this project was a labor of love, a kind of tribute to their work rather than an exploitation of it. Interestingly, my question raised other questions from writers of fandom not on the wiki, but inspired by it — would they, too, be pissing off the authors of the short stories on which their work was based? The answer, of course, was no — AI sparks different feelings than human mimicry. It took a long time for server admins to get back to me, and ultimately they said that while they agreed the license requirements would be met by attribution, if I wanted to avoid upsetting writers I would have to ask them individually.

My goal was to not upset writers, so I joined their wiki and sent about 50 direct messages to authors, explaining that I was a fan of their work and wanted to create a tribute using my own creative medium — code. Based upon the responses from at least one writer in the Discord server, I was optimistic that this would work. Unfortunately, I only received two replies back, and both of them said the same thing; that while they appreciated my asking their permission, they absolutely would not agree to my using their writing to train any kind of machine learning model. It wasn’t an uncomfortable conversation, just a disappointing one, probably poisoned from the start by the bad publicity hanging over generative AI as a whole.

This brings me back to National Novel Generation Month. When I entered, I had not completely given up on using Backrooms or SCP wiki texts, and ultimately I did use them, but in a different way. A unique feature of Backrooms lore is that it’s so focused on describing liminal spaces and places, an imaginary infinite system of dream-like levels that connect to one another in mysterious ways, and by putting the wiki entries into a structured format, I had inadvertently created a kind of “map”, or at very least a network graph, of the Backrooms universe. With a little retrieval-augmented generation magic, these data could supply something that today’s generative AI systems are notoriously lacking: any semblance of a “world model”, as well as “planning” beyond simply predicting the next token of text.

The way my code works is simple: a fictional “missing person” is described by an LLM, who is presumed to have “no-clipped”, in Backrooms parlance, into “Level 0”: the mono-yellow deserted office hellscape that inspired the Internet genre. The LLM is prompted to write a fictional message back to the real world describing their experiences in that Level, up to finding an exit (randomly selected from those listed in the wiki article). This becomes the first chapter, and the process is repeated, as our correspondent makes their way on a random walk through the many levels of the Backrooms, until they reach a level from which there is no exit. Then a new protagonist is imagined for the next Part of the book, and so on, until the NatNoGenMo requirement of 50,000 words has been written.

Actual wiki articles are prefixed to the prompts supplied to the LLM as context, so it knows details about each level, such as which entities live there (though I didn’t bother pulling in the entity wiki entries — maybe some other time). It’s also been given a goal (a particular exit), including a hint as to how it can achieve said goal. I was curious to see whether this would be sufficient for the LLM to exhibit some measure of planning-like behavior in structuring the chapters.

I had given up on developing a language model purely based upon open-licensed data for this experiment; Phi 1.5 seemed promising but failed to write anything remotely related to the context I gave it, and there was too much code model left in my fine-tuned StarCoder for it to be usable here. That being said I didn’t want to use ChatGPT either. The chat completions API probably could be abused to perform the tasks I wanted but only at the risk of making the output text sound more generic and, well, “chatty.” So, I opted for another Microsoft model, Orca, which is notable for performing well on reasoning tasks despite its small size. I thought this might be useful in approaching the challenge of planning narrative within a fictional world, and as a side benefit, I could run inference on the model locally. However, Orca is not GPT-4, and it shows: sometimes the model completely ignores the context, or outputs blank lines, or emits nonsense.

November ended before I could engineer prompts to avert the worst of these glitches, but also, a little glitchiness and narrative instability is absolutely in keeping with the spirit of the Backrooms genre: many wiki entries feature lovely human-created examples of this, replete with CSS format hacking and ASCII mayhem. One could not possibly wander such a confusing space for eternity without losing their mind a little bit, forgetting who or where they are, or what they are supposed to be doing. Even minor inconsistencies between my synthetic fanfic and the “official” wiki articles are forgivable because, unlike other fandom genres, there is no canon in the Backrooms — everything is completely made up, inspired by a very short and enigmatic 4Chan copypasta which read as follows:

If you’re not careful and you noclip out of reality in the wrong areas, you’ll end up in the Backrooms, where it’s nothing but the stink of old moist carpet, the madness of mono-yellow, the endless background noise of fluorescent lights at maximum hum-buzz, and approximately six hundred million square miles of randomly segmented empty rooms to be trapped in
God save you if you hear something wandering around nearby, because it sure as hell has heard you
— Anonymous, 4chan (May 13, 2019)

Update, June 2024

Using Orca was a quick-and-dirty “solution” to my NatNoGenMo problem that left me feeling unsatisfied, as the Orca model was trained on a LLAMA 2 base, and that model is well-known to have been trained on copyrighted material. Plus, it’s not clear whether using the Backrooms’ authors work as part of a “prompt” to that model is any better or worse than fine-tuning a language model on the wiki. Again, the exercise was undoubtedly fair use, and I had made attempts to limit use of copyrighted material, but I didn’t feel like I had been successful in that regard.

So, I returned to StarCoder, this time with the idea of building a chatbot that would use retrieval-augmented-generation, but only pulling in Creative Commons data — this time from Wikipedia. I had found an old research dataset called “Wizard of Wikipedia” which was perfect for this, consisting of human conversations about various topics, grounded in selections from a Wikipedia-style knowledgebase. I also knew that it would be possible to represent chat data in a format that would be more similar to the kind of structured data that StarCoder had already seen (a direct dump of HuggingFace’s suggested chat history data structure to JSON), overcoming the hurdle I had encountered before in going from pre-training on code alone to fine-tuning on creative writing.

This strategy paid off, and the result is a series of 1B, 3B and 7B-parameter models that are capable of chat which stays relatively grounded in the topic chosen for discussion, if that topic can be correctly identified. The 7B model performs best at this, with the lowest rate of hallucination, producing conversations that are not far off from GPT-J in quality. The 3B models are passable with an added filter to prevent hallucinations (a separate “entailment” model that checks whether statement “A” follows from statement “B”, or contradicts it). The 1B model is a very mixed bag: it can operate as a fairly decent summarization model for a single turn, but veers off course easily, and if run in chat mode without a specific topic summary provided as context, the hallucinations can become truly wild; on par with anything produced by GPT-2.

Ironically, that smaller 1B model is the one I’ve spent the most time with, because it’s the most fun. At the end of the day, I’m generally skeptical of most claims of usefulness and/or helpfulness associated with generative AI, and it’s just as easy for me to look up a Wikipedia article written by humans and draw my own conclusions from it than to chat with an easily misled AI bot about the topic. What was fun about playing with language models, when it was fun, was the interplay between human and machine — I always viewed it as a kind of generative theater which quickly became boring if humans weren’t involved, or if the machine output was too predictable. So, I took that 1B model, quantized it, fine-tuned it on Reddit data, and let it run with very little context on r/SubSimGPT2Interactive, the bot-friendly community where this began for me. As such, I achieved my other goal of making it possible to continue participating in what I would argue is a harmless hobby, with less guilt about the way GPT-2 was made in the first place. I still have doubts about whether my new bot, which also generates images using the Mitsua text-to-image model trained on Creative Commons data, is exemplary in terms of respecting data dignity — after all, the Reddit users whose comments were included in my fine-tuning data set didn’t consent to that application. But it’s the best I can do, for now.

Forget World Models; Data Dignity is the Real Challenge for Generative AI

Written by Matthew Maybe