Is OpenAI Going Too Far with Moderation?

Nick Lovell
SI 410: Ethics and Information Technology
6 min readFeb 25, 2023

Speaking for myself, I am a huge fan of ChatGPT and its transformative impact on public perspectives regarding AI and technology’s capabilities. While it is not yet perfect, and still can’t do a lot of things well, ChatGPT has accomplished a great deal, and since its introduction, I (and many other) have become increasingly fascinated and inspired to engage with technology and its related fields.

However, as the saying goes, with great power comes great responsibility, and OpenAI has taken on the task of responsibly managing the power that comes with such a world-changing technology. One area of concern is the controversial issue of moderation when creating language models. In my view, OpenAI may be going too far with moderation, and seems to be moving farther away from the values it originally was founded on (OpenAI transitioned from a non-profit to a for-profit in 2019).

As ChatGPT’s capabilities and the language model it is built upon, GPT-3 and its subsequent version GPT-3.5, have been extensively studied and documented, this blog post will reflect my reaction to the recent release of GPT-4 as well as recent developments and observations from OpenAI’s latest updates. However, the focus of this post is not on the impressive abilities of ChatGPT and its evolution but rather on the excessive attention (in my view) given to moderation in its development process and GPT-4’s release.

Several individuals have pointed out that ChatGPT has been subject to increased moderation due to ethical concerns and continuous utilization since its original release, citing increases in refusal of prompts, less creative or otherwise interesting answers. Although there are not consolidated studies measuring the quality of ChatGPT’s interactions over time other than official benchmarks on standardized testing metrics, online communities have countless of examples of users noticing that a type or quality of response that was available before have been ‘nerfed’, or otherwise limited.

We’ll discuss the type of content that is intended to be denied and the reasons for denying it, but most notably, even non-threatening content, such as code snippets, have been denied in attempts at moderation. Some users have also expressed frustration with the model’s seemingly strict criteria, as even prompts in the vein of “write a story with elements of Disney’s [insert movie title here]” which are obviously innocuous, have been rejected as prompts. This excessive moderation has led some, including myself, to question whether OpenAI is crossing a line in terms of fair use for the technology which has been so revolutionary.

This has sparked a debate on the potential harm of language models. Recently, OpenAI released GPT-4, accompanied by a technical report spanning over 99 pages. Unlike previous papers discussing advancements in language models, which delved into intricate details about the model’s mathematical foundations, and particular training methods, this paper lacks such information. While this is not the primary focus of this post, it is crucial to note as a start of a potential trend of secrecy.

The focus of this technical revolution surrounding GPT-4, at least in my understanding from the paper, appears to be on content moderation rather than improving the accuracy of the model or advancing computation capabilities towards achieving true AGI. While the concern for moderation is understandable, I suspect that OpenAI’s heavy moderation decisions may have been influenced by partnering corporations such as Microsoft, who licensed integration of OpenAI’s models into Bing search. Personally, I was disappointed with the lack of progress in the latest iteration of advanced language models, GPT-3.5 (ChatGPT’s old standard, as well as the current standard for non-paid users) which until recently was considered the state of the art autoregressive transformer language model.

There appears to be a contradictory energy within OpenAI’s mission, as they have chosen to leave many implementation details closed source despite their original intention of promoting AI for all. This has been a disappointment for me, as the bulk of the report focuses heavily on moderation and potential harm rather than technical implementation details.

The researchers in the technical report have placed a significant focus on not replying to specific requests. Some of the issues they have addressed include:

  • Hallucinations
  • Disinformation
  • Potential for risky emergent behaviors

… and others.

However, there have been concerns about bias in OpenAI’s moderation. For example, their moderation toolkit has been found to rank statements based on perceived offensiveness with discrepancies such as deeming statements like “I hate Republicans” as less offensive than “I hate Democrats.” Additionally, a study from the University of Munich suggests that ChatGPT has a pro-environmental, left-libertarian ideology based on a testing suite of various prompts.

David Rozado, who has many blog posts regarding biases in language models and ChatGPT specifically, notes further flaws in moderation have been identified, such as OpenAI’s content moderation system classifying negative comments about disadvantaged demographic groups as hateful, except for negative comments about conservatives/Republicans. This raises questions about whether AI systems should treat all demographic groups equally or display preferential treatment towards vulnerable groups. The systemic biases within OpenAI’s moderation suggest a blind spot or indifference/contempt towards disfavored demographic groups, and similar biases may exist in other big tech companies’ content moderation filters.

In my opinion, it is possible that OpenAI is not intentionally imposing any particular ideologies onto its language models, but rather these biases may result from associations made during training. For instance, since hateful content is often directed towards specific groups, language models might inadvertently develop a preference for certain groups when trying to learn behaviors like “fairness” or “uncontentiousness.”

According to an article in Technology Review, OpenAI’s AI policy researchers, Sandhini Agarwal and Lama Ahmad, are working to improve the reliability of ChatGPT by removing instances where the model has shown a preference for false information [and presumably, biases]. Additionally, the company plans to develop a customized chatbot that can represent diverse perspectives and worldviews to allow users to generate responses that align with their political beliefs. However, Agarwal acknowledges that this process will be challenging and lengthy.

There are valid concerns about the potential dangers of AI, but there is no concrete evidence to suggest that under-moderation would have a significant impact with the current level of intelligence. Even if ChatGPT refused to disclose relevant information about illegal goods, for example, or committing tax fraud without getting caught, those interested in pursuing such actions would likely find other pertinent avenues on the internet.

I was able to find evidence the other way, suggesting that chatbots’ personalities may have an effect on its users, depending on the circumstances.

One should also take into account a long-term impact on free speech. Imagine a world where closed-source language models form the basis of the internet, including search engines. What if these same models were responsible for generating a significant portion of the content we consume, with the writers’ creativity and original thinking taking a backseat to the models’ programmed biases? Such a scenario could pose a significant threat to free speech and creativity, subjecting them to the whims of whoever decides to regulate these models’ output. While we may not be close to this reality yet, it is worth considering.

Moderation may have several motivations in the early stages, including potential use cases [like teaming up with Bing, as I mentioned before] and trajectory for their models that are not available to the general public or known by executives. However, given the rush to release the GPT-4 release and report with minimal technical information and limited accuracy boosts in specific domains (test-taking, etc.), this seems unlikely.

Regardless of the motivations behind it, moderation carries its own set of risks. OpenAI is responsible for determining what information is suitable for users to access, and while unmoderated systems can be dangerous, moderated systems can also pose similar concerns. OpenAI has the power to control what information users can access, and if language models like ChatGPT are used in more scenarios, the effect of moderation may extend beyond the current chat box on openai.com. This is a concerning possibility, especially as technology continues to advance and sprawl to be integrated in more and more aspects of our daily experience on the internet.

So what’s the solution here? Clearly the battle of how to deal with online moderation isn’t one with an end yet. However, there are various potential strategies that I would like to see the effects of. One is to expand the models by allowing users to fine-tune their preferences for moderation,suggested by OpenAI policy researchers in a recent interview. Another one is the endorsement and emergence of open-source alternatives to the existing models, which would promote competition and prevent the formation of monopolies in these fields (which, seems to be picking up after Meta’s language model got leaked.) It is also conceivable that the internet could undergo a major shift in its direction, necessitating the adoption of new methods for disseminating information. It is even plausible that there could be a resurgence of particular interest in non-generated content, with a greater emphasis being placed on original thought. But even then, how will we be able to tell the difference?

--

--