Building content moderation at scale with human-centered ML

Why automating abuse detection is hard, and how we bring together experts and AI to tackle it.

Sentropy Technologies

Published in

Sentropy

7 min readMar 12, 2021

By Alex Wang & Cindy Wang

Introduction

Over the past two years at Sentropy, we’ve thought critically about how to build an effective, safe, and robust ML system for content moderation. At face value, abusive content detection might seem like a straightforward classification task. In reality, building a solution for content moderation requires thinking about a diverse set of stakeholders — including but not limited to end-users, platforms, moderators, and legislative bodies. From building data sets to developing models, in this post we’ll dive into some of the machine learning and data challenges we’ve faced and how we addressed them.

Defining abuse

Before we could even think of building models, we needed clear, comprehensive definitions for the different types of abuse that users might encounter. Class definitions that are overly vague might lead to bias or disagreement between annotators. On the other hand, specifying exact keywords could negatively impact certain groups of users. For example, always labeling the n-word as hate speech would certainly give us high recall, but would be naive when applied to in-group usage and African American Vernacular English (AAVE).

In collaboration with domain experts such as linguists, online community members, and content moderators, we established a typology of four main classes of toxic speech, further divided into thirteen subclasses. We published our definitions as a guide for industry practitioners as well as to facilitate academic research in this area. You can find our ACL workshop paper here and our open-sourced definitions in this Gitlab repository. A simplified list of definitions is also available in the Sentropy API docs.

Source: “*A Unified Typology of Harmful Content”* by Michele Banko, Brendon MacKeen, & Laurie Ray

Building Datasets

Data annotation and iteration on our toxic class definitions is a never-ending feedback loop. As new domains and forms of toxic language inevitably emerge, our models require continuous exposure to new data. In order to stay on top of this rapidly fluctuating landscape, we work with domain experts — the same people who create our class definitions — to create the datasets that our models are trained on.

Towards this end, we’ve built tools to quickly and accurately label large quantities of data, while continuously improving our definitions and keeping our datasets fresh.

Diverse data sources

Toxic language appears in all types of places, from message boards to comments sections, to dating apps, to the infinite other domains where people can interact online. In order to capture the linguistic variety across different platforms and communities, we’ve trained our models on data from diverse sources, spanning the gamut from mainstream social media websites to websites on the dark web. Since new forms of abuse often originate from these less-accessible parts of the internet, we’ve built our system to detect these patterns before they even reach mainstream platforms.

New types of toxic language

The meanings of words change over time; every day, new toxic language is invented and old terms fall out of use. We’ve built and continue to grow a database of toxic terms for each abuse class, using data-driven lexicon expansion algorithms with a human in the loop to verify data quality. This process collects recent messages from various online sources at regular intervals and uses existing lexical knowledge to suggest new terms that are being used in similar contexts. This dynamic database keeps Sentropy up-to-date with the latest trends of online abuse and guides us during data collection and model development.

Though building this system required a significant initial investment into manual lexicon curation, it now enables us to keep our existing models fresh with new examples of toxic language or even quickly build a classifier for a new abuse class.

Dealing with class imbalance

The vast majority of user-generated content online is non-abusive. If we collected training data by sampling randomly from the web, we would need to collect a massive amount of data in order to get enough abusive examples to train a decent model. Furthermore, the annotation resources needed to label such a dataset would be prohibitively expensive.

To tackle this, we built our own data sampling infrastructure to ingest fresh data from public data sources and apply different clustering and bootstrapping techniques to further segment the input space. For instance, we can use clustering to sample for data similar to examples we know are abusive, or we can use bootstrapping to sample for examples where our existing models are less confident. This workflow ensures that we are annotating the most useful examples for our models to see during training, and it helps us build datasets that cover the diverse types of examples we might encounter in the wild.

Adversarial inputs

Another challenge that sets abusive language detection apart from generic ML tasks is adversarial users — users who deliberately attempt to circumvent content moderation. The large majority of these adversaries use simple misspellings and character substitutions to get around automated systems. Bad actors aside, online language is full of typos that convert benign words into unintentional slurs, and vice versa (“e.g. “duck you!”).

While naive systems such as word filters are susceptible to this problem, we augment our training data with programmatically-generated adversarial examples, which help our models recognize abuse even in adversarial settings. In addition, we make use of subword-based models to further improve robustness against common misspellings and substitutions.

Model Development

In recent years, the field of NLP has shifted rapidly, and state-of-the-art modeling approaches have even exceeded human performance on certain benchmark tasks. We continually experiment with state-of-the-art technologies to improve the performance of our models and minimize their biases across demographics of users and sub-classes of toxic content. We have productionized several architectural and training processes that have proven to work well for abuse detection at scale, and have ruled out dozens of others that didn’t quite translate from academia.

Machine learning bias

Identity-based bias is a problem that has been identified and studied extensively in NLP applications and is particularly salient for abuse detection since members of certain identity groups are more likely to receive abuse. Evaluating for and combatting this type of bias in our models was one of our earliest priorities and a critical part of an effective, equitable abuse detection system. You can read more about our approach in our bias blog post series.

Multi-language support

We developed our initial models for use in English domains, but much of online communication is non-English or even multilingual. A naive approach could be to simply translate all text to a default language, but this quickly falls short on many categories of data — translation errors, figurative language, and regional slang, to name a few. An alternative is to create new datasets and models for each language, but this process is time consuming and resource intensive, particularly if the volume of data in additional languages is low.

We close the gap by productionizing recent multilingual NLP research, allowing us to train models that are effectively “multilingual” and share labeled data across languages. Not only is this approach more scalable, it also lets us bring the same context-aware abuse detection that we demonstrate in high-resource languages such as English to other languages.

Evaluating Performance

One type of model evaluation we do is minimum functionality testing using the Checklist testing framework. The tests shown here are adapted from the HateCheck test suite developed by the Alan Turing Institute.”

Top-level summary metrics on a held-out test set are often insufficient for evaluating production machine learning systems. Due to the ever-changing nature of abusive language, we need to not only regularly refresh our test sets, but also probe our models for their behavior on critical subsets of data.

For example, toxicity classifiers have been shown to falsely correlate AAVE with hate speech due to annotator bias in training data. To monitor this type of potential false-positive behavior, we evaluate our model performance on the slice of test data containing surface markers of AAVE. We have dozens of other clearly defined data slices that represent certain communities, specific language patterns, and even semantic topics like gaming or sports. This organization ensures that nothing gets lost in the aggregate when evaluating our models, and helps us quickly diagnose and fix model bugs as they arise.

On top of evaluating standard classification metrics such as precision, recall, and F-score, we also evaluate specific model capabilities using a methodology called CheckList, which is akin to behavioral testing in software engineering. Our functional tests are inspired by HateCheck, an academic test suite for hate speech detection models.

Conclusion

We’ve given you a quick tour of what Sentropy’s ML approach looks like. The gap between the theoretical world of machine learning (found in academic papers and well-sanitized Kaggle datasets) and practical applications is notoriously difficult to bridge. No doubt there are ways to build solutions that present a front of reliability, but that quickly fall apart when deployed in the real world.

The challenges we’ve discussed are a few of the many things we’ve thought about as we’ve built the models powering Sentropy. Building a content moderation solution is not a one-time investment — the hard work lies in diligently and meticulously developing practical definitions, building scalable data pipelines, being cognizant of potential risks and harms, and continuously improving performance as language and communities evolve. At the end of the day, everything we build is enabled by a curious excitement for tackling this problem, and a conscientious desire to tackle it right.