Best Practices for Using Off-the-Shelf Models for Content Moderation
The volume and velocity of content generated by social media can be very challenging for moderation workers to keep up with. Moderation work can also contribute to severe psychological distress for workers and researchers. Some automated or computational assistance can support content moderation by reducing workload or filtering psychologically distressing content. Supervised machine learning (ML) is a common tool used to automatically detect harmful content. ML relies on data annotated by people to train models to recognize harmful content. People annotating these training data are directed by annotation tasks about how they should identify and annotate harm.
Content moderation communities or researchers may design their own annotation tasks to generate training data for new harmful content detection models, or use off-the-shelf (OTS) pre-trained harmful content detection models like Perspective API, Amazon Rekognition, Azure’s Content Moderation API, WebPurify API, and DeepAI’s content moderation API.
OTS model reuse can enable detection tasks in resource-constrained contexts, for example when researchers or content moderation communities do not have access to the computing power, people, time, and funding necessary to train new models. Reusing OTS models can also help to reduce the environmental impact of training new models, an energy-intensive process.
But, how do you know which one to use?
In our research, we developed a decision tree to help you decide which (if any) OTS model to use for your harm detection task, when all you have access to is the model itself, the model’s supporting documentation, and information about the annotation task used to prepare training data for the model. We started by asking the following questions about three OTS models:
- What do these models say they’ll do (think explicit promises about which harm concept they will detect)?
- What do these models actually do?
- How do these models’ actual behaviors line up with what they explicitly promise to do?
- What can explain models’ actual behaviors (if not their explicit promises)?
Answering these questions helped us make sense of how we can expect OTS models to behave. Based on what we found, we put together a decision tree that can help you decide whether an OTS model is right for what you’re trying to do as a content moderator or researcher. We’ve included the decision tree in the image below, and we’ll talk you through how to use it next (for a more detailed walk-through, see pages 13–19 of our published article).
Decision Tree User Guide
Step 1 [Box 1–3]: First thing’s first — look at how the model defines the harm concept it says it’s going to detect!
We reviewed fourteen models’ harm concept definitions, and found that they typically describe harm according to comment authors’ intents [author features]; comments’ directedness at a particular individual or group, featured behaviors, tone, or specific language [comment features]; or effects on comment audiences [audience features]. The figure below provides a visual breakdown of these harm concept components.
We’ve annotated Jigsaw Perspective API’s definition ‘toxicity’ and Davidson et al.’s definition of ‘hateful’ with harm concept features as examples:
Toxicity: “A rude, disrespectful, [audience-based effect] or unreasonable [author-based intent] comment that is likely to make readers want to leave a discussion [audience-based effect].”
Hateful: “Expresses hatred [author-based intent] toward a targeted group [comment-based directedness] or is intended to be derogatory, to humiliate, or to insult [author-based intent] the members of the group [comment-based directedness].”
In each of these definitions, you can see references to some combination of author-based intent, comment-based features, and/or audience-based effect.
Keep in mind, however, that when annotators are given definitions like these to identify harmful content in training datasets, they typically only receive access to comments themselves and may not have access to information about a comment author’s intent or the effect that the comment had on an audience. We find that the more a harm concept definition specifies comment features in particular (e.g., directedness toward ‘a targeted group’ as per Davidson et al.’s ‘hateful’ definition), themore consistently annotators may identify harm in comments. If an OTS model’s harm concept definition specifies comment features that annotators can use to identify the presence of that concept, you can reasonably expect the model to detect that concept (Box 1).
Now you can check if the OTS model’s harm concept definition matches up (fully or partially) with the concept you actually want to study. It might seem simple, but it’s an important step! If it fully matches, things are looking good for reuse and you can move onto Box 6.
If the harm concept doesn’t specify comment features and/or doesn’t match up with what you want to study, move onto Box 4.
Step 2 [Box 4]: Even if the OTS model’s harm concept definition doesn’t specify comment features (think Perspective API’s ‘toxicity’ definition) or the definition only partially matches up with what you’re trying to study, not all hope is lost!
Check out if you can adapt the OTS model to better reflect your concept of interest. For example, in addition to toxicity scores for content, Jigsaw’s Perspective API returns attributes ‘insult’, ‘identity attack’, ‘profanity’, ‘threat’, and ‘severe toxicity’ as per their definitions of these terms. While you may not be able to change the weights of these terms (unless you work for Perspective API’s development team), you could think about developing your own model trained on a content sample labeled by Perspective API, but with differently weighted attributes. Might be a little messy, but this approach could help you make a weighted definition of harm reliant on more specific comment features that matches your harm detection needs. If you can adapt the model to your needs, move onto Box 6. If not, move onto Box 5.
Step 3 [Box 5]: In our research, we found that regardless of harm concept definition, annotators identified comments containing insults and name-calling (comment features) as harmful. We’re not totally sure why this happened — it might be that insults and name-calling reflect how annotators’ perceive harm in general, for example. While these findings need some more digging into to see if the same pattern holds across other harm concepts, we tentatively suggest that if the concept you’re trying to detect is mostly about insults or name-calling, regardless of the OTS model’s harm definition, you may still be able to reuse the OTS model effectively. If so, go ahead to Box 6. If not, we don’t recommend reusing the model.
Step 4 [Box 6–8]: These steps are all about reviewing the annotation task used to generate training data for the OTS model. We include these boxes because, though counterintuitive, we found that annotation tasks to detect particular harm concepts don’t always include the harm concept’s definition.
So first (Box 6), we recommend you verify that the annotation task actually included the harm concept’s definition. If yes, go onto Box 7. If not, avoid reuse! Next, if the harm definition provided in the annotation task refers to comment author intent or the comment’s effect on an audience (Box 7), check to see whether the annotation task provides additional information to help annotators make judgments about these features (Box 8). If not, we don’t recommend reuse. If yes, you might still have a good candidate for reuse.
Let us know what you think
If you end up trying out our decision tree to evaluate an OTS model for reuse, let us know how it went! What worked well for you about it? What felt like it was missing? You can find our full paper and contact information here.

