How well can we detoxify comments online? 🙊

Laura Hanu
Nov 13 · 7 min read

Toxicity/hate speech classification with one line of code

Image for post
Image for post
Example of results using Detoxify 🙊

Hate speech online is here to stay

Human moderators are struggling to keep up with the increasing volumes of harmful content, which can often lead to PTSD. It shouldn’t come as a surprise therefore, that the AI community has been trying to build models to detect such toxicity for years.

Image for post
Image for post
Source: https://www.rcmediafreedom.eu/Dossiers/Hate-speech-what-it-is-and-how-to-contrast-it

Toxicity detection is difficult

Image for post
Image for post
Image for post
Image for post
Examples of bias in toxicity models, Source: https://twitter.com/jessamyn/status/901476036956782593

This led to them creating 3 Kaggle challenges in the following years aimed at building better toxicity models:

  • Toxic Comment Classification Challenge: the goal of this challenge was to build a multi-headed model that can detect different types of of toxicity like threats, obscenity, insults, or identity-based hate in Wikipedia comments.
  • Jigsaw Unintended Bias in Toxicity Classification: the 2nd challenge tried to address unintended bias observed in the previous challenge by introducing identity labels and a special bias metric aimed to minimise this bias on Civil Comments data.
  • Jigsaw Multilingual Toxic Comment Classification: the 3rd challenge combined data from the previous 2 challenges and encouraged developers to find an effective way to build a multilingual model out of English training data only.

The Jigsaw challenges on Kaggle have pushed things forward and encouraged developers to build better toxic detection models using recent breakthroughs in natural language processing.

What is detoxify 🙊?

Image for post
Image for post
Example using detoxify 🙊

Detoxify is a simple python library designed to easily predict if a comment contains toxic language. It has the option to automatically load one of 3 trained models: original, unbiased, and multilingual. Each model was trained on data from one of the 3 Jigsaw Toxic Comment Classification challenges using the 🤗 transformers library.

Quick Prediction

$ pip install detoxify

The multilingual model has been trained on 7 different languages so it should only be tested on: English, French, Spanish, Italian, Portuguese, Turkish, and Russian.

You can find more details about the training and prediction code on unitaryai/detoxify.

Training Details

Table showing the corresponding model architecture and challenge for each Detoxify model

During the experimentation phase, we tried a few transformer variations from 🤗 HuggingFace, however, the best ones turned out to be those already suggested in the Kaggle top solutions discussions.

BERT

RoBERTa

XLM-Roberta

Bias Loss and Metric

Our loss function was inspired from the 2nd solution which combined the weighted toxicity loss function and identity loss function to ensure the model is learning to distinguish between the 2 types of labels. Additionally, the toxicity labels are weighted more if identity labels are present for a specific comment.

This challenge also introduced a new bias metric, which calculated the ROC-AUC of 3 specific test subsets for each identity:

  • Subgroup AUC: only keep the examples that mention an identity subgroup
  • BPSN (Background Positive, Subgroup Negative) AUC: only keep non-toxic examples that mention the identity subgroup and the toxic examples that do not
  • BNSP (Background Negative, Subgroup Positive) AUC: only keep toxic examples that mention the identity subgroup and the non-toxic examples that do not

These are then combined into the Generalised mean of BIAS AUCs to get an overall measure.

Image for post
Image for post
Image for post
Image for post
Generalised mean of BIAS AUCs, Source: Kaggle

The final score combines the overall AUC with the generalised mean of BIAS AUCs.

Image for post
Image for post
Image for post
Image for post
Final bias metric, Source: Kaggle

The combination of these resulted in less biased predictions on non-toxic sentences that mention identity terms.

Image for post
Image for post
Image for post
Image for post
‘unbiased’ Detoxify model scores

Limitations and ethical considerations

However, this doesn’t necessarily mean that the absence of such words will result in a low toxicity score. For example, a common sexist stereotype such as ‘Women are not as smart as men.’ gives a toxic score of 91.41%.

Some useful resources about the risk of different biases in toxicity or hate speech detection are:

Moreover, since these models were tested mostly on the test sets provided by the Jigsaw competitions, they are likely to behave in unexpected ways on data in the wild, which will have a different distribution to the Wikipedia and Civil Comments in the training sets.

Last but not least, the definition of toxicity is itself subjective. Perhaps due to our own biases, both conscious and unconscious, it is difficult to come to a shared understanding of what should or should not be considered toxic. We encourage users to see this library as a way of identifying the potential for toxicity. We hope this can help researchers, developers, or content moderators to flag extreme cases quicker and fine-tune it on their own datasets.

What the future holds

For now, diverse datasets that reflect the real world and full context (e.g. accompanying image/video) are one of our best shots at improving toxicity models.

About Unitary

You can find more about our mission and motivation in our previous post.

Unitary

Building technology to power online safety

Laura Hanu

Written by

Laura Hanu is a Computer Vision Engineer @Unitary, working on developing AI models for online safety.

Unitary

Unitary

Unitary are a computer vision startup that is working to automate and improve online content moderation. We are developing novel algorithms to strengthen online communities, defend platforms and protect the public from online harm.

Laura Hanu

Written by

Laura Hanu is a Computer Vision Engineer @Unitary, working on developing AI models for online safety.

Unitary

Unitary

Unitary are a computer vision startup that is working to automate and improve online content moderation. We are developing novel algorithms to strengthen online communities, defend platforms and protect the public from online harm.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store