How well can we detoxify comments online? 🙊

Laura Hanu
Nov 13, 2020 · 7 min read

Toxicity/hate speech classification with one line of code

Example of results using Detoxify 🙊

Hate speech online is here to stay

Human moderators are struggling to keep up with the increasing volumes of harmful content, which can often lead to PTSD. It shouldn’t come as a surprise therefore, that the AI community has been trying to build models to detect such toxicity for years.


Toxicity detection is difficult

Examples of bias in toxicity models, Source:

This led to them creating 3 Kaggle challenges in the following years aimed at building better toxicity models:

  • Toxic Comment Classification Challenge: the goal of this challenge was to build a multi-headed model that can detect different types of of toxicity like threats, obscenity, insults, or identity-based hate in Wikipedia comments.
  • Jigsaw Unintended Bias in Toxicity Classification: the 2nd challenge tried to address unintended bias observed in the previous challenge by introducing identity labels and a special bias metric aimed to minimise this bias on Civil Comments data.
  • Jigsaw Multilingual Toxic Comment Classification: the 3rd challenge combined data from the previous 2 challenges and encouraged developers to find an effective way to build a multilingual model out of English training data only.

The Jigsaw challenges on Kaggle have pushed things forward and encouraged developers to build better toxic detection models using recent breakthroughs in natural language processing.

What is detoxify 🙊?

Example using detoxify 🙊

Detoxify is a simple python library designed to easily predict if a comment contains toxic language. It has the option to automatically load one of 3 trained models: original, unbiased, and multilingual. Each model was trained on data from one of the 3 Jigsaw Toxic Comment Classification challenges using the 🤗 transformers library.

Quick Prediction

$ pip install detoxify

The multilingual model has been trained on 7 different languages so it should only be tested on: English, French, Spanish, Italian, Portuguese, Turkish, and Russian.

You can find more details about the training and prediction code on unitaryai/detoxify.

Training Details

Table showing the corresponding model architecture and challenge for each Detoxify model

During the experimentation phase, we tried a few transformer variations from 🤗 HuggingFace, however, the best ones turned out to be those already suggested in the Kaggle top solutions discussions.




Bias Loss and Metric

Our loss function was inspired from the 2nd solution which combined the weighted toxicity loss function and identity loss function to ensure the model is learning to distinguish between the 2 types of labels. Additionally, the toxicity labels are weighted more if identity labels are present for a specific comment.

This challenge also introduced a new bias metric, which calculated the ROC-AUC of 3 specific test subsets for each identity:

  • Subgroup AUC: only keep the examples that mention an identity subgroup
  • BPSN (Background Positive, Subgroup Negative) AUC: only keep non-toxic examples that mention the identity subgroup and the toxic examples that do not
  • BNSP (Background Negative, Subgroup Positive) AUC: only keep toxic examples that mention the identity subgroup and the non-toxic examples that do not

These are then combined into the Generalised mean of BIAS AUCs to get an overall measure.

Generalised mean of BIAS AUCs, Source: Kaggle

The final score combines the overall AUC with the generalised mean of BIAS AUCs.

Final bias metric, Source: Kaggle

The combination of these resulted in less biased predictions on non-toxic sentences that mention identity terms.

‘unbiased’ Detoxify model scores

Limitations and ethical considerations

However, this doesn’t necessarily mean that the absence of such words will result in a low toxicity score. For example, a common sexist stereotype such as ‘Women are not as smart as men.’ gives a toxic score of 91.41%.

Some useful resources about the risk of different biases in toxicity or hate speech detection are:

Moreover, since these models were tested mostly on the test sets provided by the Jigsaw competitions, they are likely to behave in unexpected ways on data in the wild, which will have a different distribution to the Wikipedia and Civil Comments in the training sets.

Last but not least, the definition of toxicity is itself subjective. Perhaps due to our own biases, both conscious and unconscious, it is difficult to come to a shared understanding of what should or should not be considered toxic. We encourage users to see this library as a way of identifying the potential for toxicity. We hope this can help researchers, developers, or content moderators to flag extreme cases quicker and fine-tune it on their own datasets.

What the future holds

For now, diverse datasets that reflect the real world and full context (e.g. accompanying image/video) are one of our best shots at improving toxicity models.

About Unitary

You can find more about our mission and motivation in our previous post.


Building technology to power online safety


Unitary are a computer vision startup that is working to automate and improve online content moderation. We are developing novel algorithms to strengthen online communities, defend platforms and protect the public from online harm.

Laura Hanu

Written by

Laura Hanu is a Computer Vision Engineer @Unitary, working on developing AI models for online safety.


Unitary are a computer vision startup that is working to automate and improve online content moderation. We are developing novel algorithms to strengthen online communities, defend platforms and protect the public from online harm.