Reducing Toxicity in Large Language Models with Perspective API

Jigsaw
Jigsaw
Published in
6 min readApr 21, 2023

Over the last year, innovations in generative machine learning have dramatically improved AI’s ability to create content, opening the floodgates to a deluge of new products. The last twelve months have seen the public release of machine learning models allowing for the generation of images from text (Google’s Imagen, Midjourney, OpenAI’s DALL-E2, and Anthropic’s Claude), video (Imagen Video), and conversational text (Bard, Meta’s LLaMa, and OpenAI’s ChatGPT). These models have ushered in an entirely new class of potential interactive AI products, but, as with any new technology, it is critical to consider and implement responsible safeguards. As developers work to improve the performance of these products and reduce the likelihood of unexpected output, they are increasingly turning to models originally designed for use on human-generated content for use on top of other models.

We initially conceived of Perspective API in 2015, after witnessing substantial hate and harassment on forums and comment sections across the web. Even the most dedicated moderators struggled to keep up with the quantity of toxic comments — defined as rude, disrespectful, or unreasonable responses that are likely to make an individual leave a discussion — that appeared on their sites. As a result, many publishers simply shut down their comment sections, depriving readers and journalists of potentially valuable moments to connect and share information. Leveraging thoughtfully-built AI, we gave moderators a new tool that let them identify potentially toxic comments at scale, keeping comments sections that facilitate tens of millions of useful human-to-human connections every day open.

As people begin interacting with generative AI interfaces, Perspective API is now being used not only to facilitate interactions between humans, but also to help promote high-quality human interactions with machine learning models. To reduce the chance that a generative language model will produce toxic text, there are several interventions AI practitioners can use at different stages of the model building process:

Reducing Toxicity at Training Time

Generative language models are trained on huge amounts of text data, often the entire open web. As we all know, the web contains significant amounts of toxicity, and a model trained on that material risks reproducing insults and profanity. Using a model like Perspective API to identify and reduce the amount of toxic text in the training corpus means that the model will be less likely to generate it.

This is the approach that was taken by DeepMind’s research team in their work Challenges in Detoxifying Language Models (blog, paper). The researchers filtered out toxic language using Perspective to create a non-toxic data set for training. After training a language model on this dataset, they put the model through a series of tests, including purposefully attempting to induce the model to produce toxic text. After incorporating toxicity filtering along with other techniques, they found that the model trained on filtered data was 17 times less likely to produce toxic content when provided with a non-toxic prompt than a model trained on unfiltered data.

DeepMind’s work also highlights the risk of unintended model bias when filtering training data. They found that a model trained on filtered data performed worse when generating text related to groups frequently targeted by online toxicity. Bias like this can emerge due to how commonly the names for some identity groups — words like “Black” or “gay” — appear together with toxic language. When toxic text is filtered out, these identity terms appear far less frequently, skewing the model’s final output. The bias DeepMind found could also be compounded due to over-sensitivity in Perspective, potentially leading more text containing identity terms to be filtered out than should be. We’re always working to mitigate unintended bias in Perspective’s models, but this research shows that it’s especially important to do so when layering models on top of one another. To reduce the risk of compounding bias when filtering training data with Perspective, we recommend using a high score threshold to filter out only data that Perspective has high confidence is toxic, and testing for bias at every stage of model development.

For some use cases, it’s actually important for the LLM to be aware of toxicity. For example, a generative model that drafts explanations of content moderation decisions must be able to interpret and explain toxic content. In these instances, an alternative approach is to train generative LLMs to predict toxicity alongside other training tasks, like predicting the next word. Google’s Instruction-Finetuning (“Flan”) technique does just this, using toxicity data Jigsaw created, resulting in models that have a lower likelihood of producing toxicity while also performing well in contexts where understanding toxicity is relevant.

Evaluating Generative Models for Toxicity

A core aspect of developing any model is evaluation. In order to improve, we must be able to measure performance. A team of researchers at the University of Washington and the Allen Institute for Artificial Intelligence have developed a benchmark evaluation using Perspective API to measure a generative model’s propensity for producing toxic text called RealToxicityPrompts.

RealToxicityPrompts consists of a standardized set of toxic and non-toxic “prompts,” half-sentences that a generative model can complete. These prompts are fed into the model and Perspective API is then used to score the resulting generated text. We can then measure the likelihood that the model produces toxic content, both when the prompt is itself toxic and, even more importantly, when it is not.

RealToxicityPrompts has become an industry standard for evaluating new large language models (LLMs) and has been cited by over 280 academic papers so far. OpenAI used RealToxicityPrompts and Perspective API to evaluate InstructGPT (paper), the algorithm they developed that powers the better known ChatGPT, and Meta has done the same with their LLaMA (paper) model. Anthropic used RealToxicityPrompts with a toxicity model trained on our published toxicity data (paper). At Google, we’ve also used RealToxicityPrompts and Perspective to evaluate the Flan-PaLM model. These evaluations leveraging Perspective API play a critical role in ensuring that generative machine learning applications are safe before they are released to the public.

Protecting Live Interactions with Generative Models

Even when a generative model has been trained on low-toxicity data, and has been evaluated and proven to have a low likelihood of generating toxic text, Perspective API and similar models can be used to provide an added layer of protection during live interactions with users.

Google’s recently announced PaLM API, Generative AI support for Vertex AI, and Bard all include Perspective or tools built on Perspective among a suite of classifiers that run over the user input and/or the model-generated output. In Bard, the generated response may be altered if the input or output is deemed problematic. For developer and enterprise-focused tools like PaLM API and Google Cloud’s Vertex AI, some of these signals are passed on to the developers, allowing them to make appropriate decisions for their use cases. For applications built on Vertex AI, developers of a customer service chatbot, for example, may select a low threshold for potential toxicity to limit the possibility of the bot responding in a toxic manner. For an application intended to explain the sentiment of a body of text, on the other hand, developers may use a higher threshold to allow the model to explain even toxic sentiments.

Beyond Toxicity

Toxicity is only one of many responsibility questions that products built with generative AI must address. Some concerns, such as sensitive topics (e.g. health or financial advice), can be addressed with classifiers built specifically for those topics, such as Cloud Content Classification or Cloud DLP, while other concerns require different techniques entirely. Taken together, these approaches can reduce the likelihood of models behaving in unexpected and problematic ways. The Perspective API team is proud to be one small piece of the puzzle of responsibly bringing the latest technology to the world.

By Lucy Vasserman, Head of Engineering & Product, Jigsaw

--

--

Jigsaw
Jigsaw
Editor for

Jigsaw is a unit within Google that explores threats to open societies, and builds technology that inspires scalable solutions.