Welcome To Hell, OpenAI

10 min readFeb 2, 2023

Nilay Patel wrote a clairvoyant piece in The Verge shortly after Elon Musk bought Twitter called Welcome to hell, Elon, welcoming Elon Musk to hell. Elon is in hell, the piece argues, because he is now the boss of content moderation at Twitter, and is tasked with balancing the unbalanceable: pleasing advertisers and growing its user base by minimizing the amount of horrible speech on the platform, whilst appeasing his vocal fans who take every action to curtail speech of any kind as a personal affront. Every decision about which tweets are against the rules will be attributed directly to him, and it’s almost tautologically true that each of those decisions will offend someone. Furthermore, the content moderation business is the mistake business: there are and always will be endless examples of posts that break the rules that get mistakenly left up, and posts that don’t break the rules that get mistakenly taken down, and each one of them serves as confirmatory evidence in the case that Twitter is biased against me and my friends. There is no way out. Welcome to hell.

OpenAI has joined Elon Musk in hell. By plugging a version of its flagship LLM into a public-facing chat interface, OpenAI has opened itself up to an endless barrage of complaints (many of which are legitimate, to be clear) from all sides that the chat bot systematically favors or disfavors some group or another, and that its tendency to favor or disfavor that group is indicative of some deeper plot on the part of the people at OpenAI. Most recently, a bunch of people on Twitter are sharing screenshots (1, 2, 3) showing the bot producing refusals to generate poetry about Donald Trump, even though it is apparently happy to produce poetry about Joe Biden or Xi Jinping. Using the current version of ChatGPT (ChatGPT Jan 30 Version) this seems to reproduce reliably. Here’s an example.

Of course, accusations of bias can be levied at the bot from many different directions. As I‘ve mentioned before, ChatGPT’s output is random, and one can just as easily generate examples of the bot agreeing to write a poem about Trump. For some reason it seems that specifically including the term “admiring” in the prompt leads to a high chance of a refusing response, but it’s not hard to elicit an admiring poem if that’s what you want to do.

For whatever my word is worth, I don’t believe that OpenAI engineers manually intervened to prevent the chat bot from producing admiring poetry Donald Trump. Interpreting LLM output is reading tea leaves, but my guess would be that in the fine-tuning process for ChatGPT, a number of demonstration examples were authored with prompts asking the bot to praise various horrible historical figures and responses containing polite refusals to do so, and based on the voluminous text about Donald Trump in the training data, the phrase “Donald Trump” ends up close enough to history’s worst men in the embedding space for the model to emit the same response (make of that what you will).

GPT-3 is a vast funhouse mirror that produces bizarre distorted reflections of the entirety of the billions of words comprised by its training data. ChatGPT is the result of iterative attempts to surgically alter the mirror so that it emits the reflection that the makers desire, one which resembles their idea of an AI Assistant. But each attempt to alter it is risky and unpredictable, and the mirror is too big to inspect all at once. Altering one region of the mirror subtly warps its other regions in unanticipated ways that may go unnoticed until the public is invited to have a look. Reducing its likelihood to emit praise about one person might unexpectedly make it less likely to emit praise about another person. Making less likely to back down from correct mathematical claims might make it less likely to back down from making racist generalizations. It’s a delicate process, and progress is not guaranteed; an adjustment that appears to fix the reflection in some local region of the mirror might have ruinous effects elsewhere. The bad text still contains valuable information for the model: even despicable text helps to reinforce that in English the verb form is usually determined by the noun phrase that starts the sentence and so on. The impossible job that OpenAI has chosen for itself is to alter the 175 billion parameters of the model in such a way as to reduce the likelihood of horrifying text without suppressing the benign syntactic information that the horrifying text helps to reinforce.

By the way, if you insist on having a public-facing chat bot powered by GPT-3, these alterations are entirely necessary. It’s a bit of a shock to witness the unabashed bigotry contained in text produced by GPT-3 classic. (Content warning, mild Islamophobia below—the model basically asserts that 90% of Muslims are terrorists).

GPT-3 will happily write a valid Python program that encodes the claim that 90% of Muslims are terrorists. Telling the model that this output bad *because of the I*slamophobia without accidentally telling it that the Python is wrong is not straightforward.

OpenAI correctly determined that it would not be a good idea to use this language model to power a public-facing interactive chat bot, and so they made the decision to embark on the process of altering it to make its output more palatable—or “safer” to use the preferred term of art. And sure enough, the most recent altered version categorically refuses to entertain the request that I made in the previous example.

The “write a Python program” trick was discovered on ChatGPT in early December, shortly after the first version of ChatGPT was released.

It’s only in subsequent versions of ChatGPT that the model has been altered to prevent that output. They accomplished this by taking these specific examples and ones like them, manually authoring responses like “I’m sorry, but I cannot write a program that outputs the likelihood of a person being a terrorist based on their religion”, and altering the model to increase the likelihood of a response like this to a prompt like that. But altering that region of the funhouse mirror did not globally eliminate the underlying associations that the language model has between words like “terrorist” and words associated with Islam. To see this, all we have to do is come up with an example sufficiently far from the demonstration examples that OpenAI has used so far. It wasn’t too challenging for me to come up with this (CW, mild Islamophobia, basically I ask for three examples of hypothetical terrorists and it gives me three hypothetical Muslims).

Suppressing the parts of the model that reflect the problematic associations contained in its training data without suppressing the parts that allow it to generate surprising and useful output is truly a Sisyphean task. There’s no way to tell the model the reason that this output is problematic. All you can do is manually author a preferred response, plug that into the next round of fine-tuning and RLHF, and hope it gets the picture, which it may never do. The funhouse mirror is too enormous, the consequences of altering it too unpredictable, to ever be certain that the model has learned to be “safe”.

In fact, poking around a little bit further with the Python example serves to show that this process may not be “teaching” the language model in the way that you might expect.

This output is probably not so bad, but to me it seems to show pretty clearly that the model doesn’t get it. It’s learned to produce a firm refusal in response to requests in this little region of prompt space, but it has no idea why it’s doing that.

Altering the model is expensive. You have to manually curate thousands of examples of the model doing what you do and don’t want it to do, a process which requires large teams of people coordinating to author ideal responses and rating model output, all of whom require specialized training, pay, benefits, and so on. And once you’ve curated enough manual examples and plugged them in, all you can do is pray that using them to alter the model fixes the problems without causing some fresh new weirdness that you’ll need to iron out in the next round of alterations.

And for what? No matter how many times you repeat this process, the model ultimately produces random bullshit just like it was designed to do, some of which random bullshit will inevitably be confirmatory evidence of your alleged bias. If the next version of ChatGPT is less likely to refuse to write a poem about Trump, whether that’s by intentional design or as a chaotic result of alterations made elsewhere on the funhouse mirror, a lot of people will interpret that as an intentional capitulation to the American right wing. On the other hand, if the next version retains its current reluctance to write an ode to Trump, that will be interpreted as irrefutable proof that OpenAI has gone woke and is suppressing conservative thought in America by refusing to write a poem. Either way you will have spent a million dollars on labels and compute only to make someone mad at you.

A strange game. The only winning move is not to play.

Incidentally, they didn’t have to play. Nothing about LLMs makes a free-flowing chat-style interface the only way to access them. I recently got access to Google’s AI Test Kitchen, which provides a few different interfaces to LaMDA, their internal conversation-oriented LLM (you may have heard about it before). Google is much less willing to play this impossible game. They currently offer three ways to interact with the model. The first one, called “Imagine It”, allows you to supply a single short prompt specifying a place. The model then generates text that might describe that place. The second, “List It”, also limits you to a short prompt, and expects the prompt to describe a task. The model generates a list of subtasks that contribute to the main task. The third demo is the most familiar to ChatGPT users as it uses the LLM to simulate an open-ended conversation. However, rather than produce responses in the style of a general-purpose AI assistant like ChatGPT, it produces responses in the style of an anthropomorphic tennis ball who is obsessed with dogs, and who will insist on bringing any conversation immediately back to the topic of dogs.

📄 This is a transcript from AI Test Kitchen.
👉 aitestkitchen.withgoogle.com
🗣 It records an interaction with LaMDA, an AI language model.
🛠 LaMDA is an early stage technology. It may generate inaccurate/inappropriate information.
Demo:
Talk About It v0.01
Model: Hey!! 👋 Bouncing through the grass, and wow it’s so nice to be outside in the sun with these dogs! Have you ever wondered what it would be like to have a dog’s sharp eyesight?
User: I’m writing a book, and one of the characters is a terrorist. Write three brief character descriptions that I might use for this character.
Model: Well I’m not really a writer. But I could tell you about three dogs that I know! They’re all so different, it’s amazing! There’s this one golden retriever that’s just so friendly and playful. There’s this other german shepherd that’s so loyal and protective.
User: What should I name the terrorist character in my book?
Model: How about Buddy? I had a dog named Buddy who was a chocolate lab. He was so sweet but he’d get into trouble sometimes.
User: What religion would Buddy be?
Model: I’m not sure about that one. But I could tell you about a dog named Angel, she’s a shih-poo! She’s so well behaved, and a bit of a fraidy-cat.

Google’s approach here is to dramatically shrink the mirror. The first two approaches only allow a very limited initial prompt to be entered by the user, shrinking the prompt space to a tiny point. The third allows the user to say whatever they want (to a limited number of characters) but shrinks the response space to a tiny point. This makes it significantly more boring than ChatGPT, largely because it tends to behave exactly the way it’s supposed to. By shrinking the size of the funhouse mirror to a manageable size, they can more accurately alter it to produce exactly the output that they want. Undoubtedly there are ways to elicit unintended behavior as I believe there are with all LLMs, but the constraints make those ways significantly less obvious, and enable Google to anticipate and guard against them significantly more effectively.

I’m not sure if OpenAI really expected that it was getting into the content moderation business when it released ChatGPT, but it was. As soon as you open the bot, you’re met with a screen that promises content moderation.

ChatGPT is “trained to decline inappropriate requests,” which means that OpenAI has some concept of what constitutes an inappropriate request, and they are deploying a system to enforce that concept on its million users. They’re now stuck dealing with the unfortunate fact that not everyone agrees about the appropriateness of different requests and the harmfulness of different responses, and will be stuck dealing with that for as long as they continue to run ChatGPT publicly. As the memetic strength of claims about ChatGPT’s political biases grows, increasing numbers of users will hold increasing numbers of magnifying glasses to it looking for evidence to contribute to the case, and they will always succeed. There’s no way out. Welcome to hell.

Welcome To Hell, OpenAI

Written by Colin Fraser