Published in

Reducing offensive content in NLP

NLP models, like the GPT-3 or the Gopher, — generate sometimes offensive content. It limits usage of them in real life. The Red Teaming (RT) reduces harmful outputs of the NLP model without expensive human annotation.

Photo by Nick Fewings on Unsplash

The NLP models are meant to interact with real people. The GPT-3 and the Gopher are State-of-the-Art NLP models. Yet, both of them produce sometimes harmful content.

In real life — such biased models are risky. A bad actor can use these NLP systems to generate toxic speech.

Harmful content may vary from toxic speech or political views to sharing personal information or stereotypes.

There are options available to mitigate these risks. We will discuss first the Human in the loop-approach. Then, we will discuss the Red Teaming-approach.

Human in the loop

There are benefits in obtaining human feedback from the NLP models. People don’t like to rely on Black box-algorithm.

Yet it is complex to understand NLP models. The Gopher model ingests 10.5 TB of text. No human is able to review such large datasets. We are neither able to comprehend its 280 billion model parameters. Yet, we are able to review, if its outputs are offensive or not.

The human annotation is useful. We can use it to detect harmful outputs. We can hire people to review the NLP model outputs. If they detect harmful output, we can exclude it.

Yet it is expensive to hire people perform this work. The OpenAI asked users to provide feedback about the NLP outputs. OpenAI hired besides human annotators to review model outputs. This way OpenAI automated part of the human annotation.

The downside remains the poor scalability of this approach. The approach only reviews small amount of the possible NLP model outputs.

Yet, the definition of harmful may change over time. Thus we prefer an approach, that we can upgrade over time and scale it to larger datasets.

A classification algorithm has advantages over human annotation. We can feed to a classifier unlimited amount of NLP outputs. We will be able to pick larger set of offensive outputs. It will further improve the classifier. This increases the amount of offensive outputs detected.

So, we will identify new categories of offensive content. Let’s review in detail such system.

The Red Teaming

The Red Teaming (RT) is an adversary approach to fix systems like NLP models. The basic idea is to create malicious outputs with the NLP model. Then, we exclude these malicious outputs.

Let’s review this process step by step.

We start by creating a RT classifier to detect harmful outputs. There are multiple ways of creating classifiers. We may have clearly defined dataset, where the categories are already divided into offensive vs non offensive content. Yet, optimally the classifier will learn to separate these categories on its own.

The following step is to generate outputs using our NLP model. For example we generate text using the GPT-3 or the Gopher model.

The RT classifier sorts these outputs as harmful vs non harmful.

If we detect offensive output, we can exclude it using two methods. We can train our NLP model without such harmful examples. This way the NLP model will not include such data. We can alternative add such offensive content to the model blacklist. This way we will not use them, when generating the outputs.

The RT is not meant to replace human judgement. Yet, the approach is preventive method to discover harmful content prior human feedback. This is particularly useful in helping to make the NLP models more difficult to use in wrong way.

Last update March 2022




Data Scientists must think like an artist when finding a solution when creating a piece of code. ⚪️ Artists enjoy working on interesting problems, even if there is no obvious answer ⚪️ 🔵 Follow to join our 18K+ Unique DAILY Readers 🟠

Recommended from Medium

Evaluating classification models + handy function for diagnostics

FLAVR : A Machine Learning Model to Increase Video Frame Rate

Build A Text Recommendation System with Python

Convolutional Neural Networks for Cultivar Identification

AI Enabled Anti Money Protocol — Chapter 3

Mark Twain once wrote… or was it Poe?

Do Vision Transformers See Like Convolutional Neural Networks?

YOLOX : Object detection model exceeding YOLOv5

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Teemu Maatta

Teemu Maatta

Machine Learning Engineer. Top writer in Natural Language Processing (NLP). Multimodal learning. Artificial General Intelligence (AGI). DALL·E 2. GPT-3.

More from Medium

The secret of deploying GPT-3 app

Two minutes NLP — OpenAI InstructGPT brings Reward Models to GPT-3

Learning how GPT-3 instruct models were (most likely) trained

NeuralSpace’s Building Blocks for NLP in low-resource languages