Content Moderation

Many platforms, particularly social media sites and ecommerce marketplaces, establish policies to determine what they consider acceptable content. These platforms implement such policies through content moderation systems.

Values and Tradeoffs

The policy underlying a content moderation system reflects a particular set of values. When there are conflicts among those values, the policy provides a framework for managing the tradeoffs. At least that’s the theory.

In practice, content moderation tends to be messy. It’s difficult to clearly articulate a set of values, let alone to resolve their conflicts and tradeoffs. And there are inevitably exceptions to any policy. Nonetheless, a platform cannot avoid having a policy, even it’s only implicit. And a good-faith system to implement an explicitly stated policy is better than no system at all.

This post does not take a point of view on the values or policy that a platform should use to determine what content is acceptable. Beyond legal and regulatory compliance, that’s an ethical or editorial choice. Rather, this post explores how to apply content understanding methods in the implementation of a content moderation policy.

Classification and Annotation

Broadly speaking, a content moderation system can act at two levels: on the document as a whole, or on a portion of a document such as a word or phrase. These two levels correspond to content classification and content annotation.

Content classification maps a piece of content to a predefined set of categories. Depending on the type of platform, categories may be topics, product types, or any other set of enumerated values. A content moderation policy for an employee discussion platform could specify unacceptable topics, such as religion or politics, while a policy for an ecommerce marketplace could specify unacceptable product types, such as firearms or pornography. Content classification can also address more mundane policies, such as requirements that content conform to a particular language or format.

Content annotation focuses on specific words or phrases within the content. Content annotation could be used to implement a policy that specifies unacceptable language, such as profanity or hate speech. Content annotation could also be used to implement a policy that prohibit sharing contact information, such as a phone number or email address, in a post. While content classification can help determine whether or not a piece of content is acceptable in general, content annotation focuses on the acceptability of individual elements within the content.

Not Just Rules and Models

Content classification and content annotation can use rule-based approaches, machine learning, or a combination of the two. In general, classification tends to be more amenable to machine learning, while annotation is more amenable to rule-based approaches, such as regular expressions.

But content moderation raises particular challenges that need to be taken into account by classification and annotation systems:

  • Sparsity. Typically — or perhaps hopefully! — most content on the platform is acceptable, with policy violations being the exception. That’s great for the platform, but it can be a problem if there isn’t enough data to train and evaluate models that detect policy violations. It may be necessary to adjust sampling or use synthetic data to address the class imbalance.
  • Adversarial Behavior. Content creators who are aware of the content policy may try to avoid their content being flagged as violating that policy. That can turn content moderation into arms race, which in turn requires frequent retraining as violation and detection co-evolve.
  • Asymmetric Costs. There is an inevitable tradeoff between false positives (acceptable content incorrectly flagged as unacceptable) and false negatives (unacceptable content incorrectly flagged as acceptable). But not all errors incur the same costs. Content may be annoying or outright harmful, while penalties to creators could range from delay to suspension or bans. Content moderation needs to take the costs or errors into account.
  • Bias. Achieving an optimal tradeoff between false positives and false negatives, challenging as it is, may still lead to a biased outcome. If the errors disproportionately affect a segment of the content or users, citing the average accuracy will not make up for the biased impact. Bias, beyond the damage it causes directly, can also hurt the reputation of the platform.

These challenges make content moderation far more complex and nuanced than simply implementing automatic content classification and annotation.

Human Review

Because of these challenges, most content moderation systems include a human review mechanism to supplement their automated components.

Human review is expensive: it costs money to pay the reviewers and time to conduct the reviews. Moreover, reviewing content can be unpleasant and even traumatic for the people who perform it. So platforms generally aim to automate as much as possible, using human review only when necessary.

In some cases, the automated components can flag a piece of content as unacceptable with high enough confidence to act without human review. That is cheap and efficient process when it achieves acceptable error rates — though there may be an appeal process for content creators to request human review when they disagree with the automated decisions.

In other cases, the automated components only have enough confidence to place content in a queue for review. Doing so delays publication of the content until the reviewers get to it. And, if the queue grows more quickly than the reviewers can process it, the queue becomes unsustainable.

Indeed, having a queue creates yet another tradeoff: whether to make content available while pending review. What’s worse: exposing potentially violating content, or delaying the publication of acceptable content? This tradeoff depends on how long it takes to get through the queue: minutes or hours might be an acceptable price to pay for the lowered risk, but days or weeks might not. A compromise is to reduce the distribution of content pending review without hiding it completely, but that’s also a tradeoff. There’s no perfect solution.


Content moderation is how platforms implement their values as a process to detect and remove unacceptable content. It relies on rule-based and machine learning methods for classification and annotation, supplemented with human review. Implementing content moderation requires managing many complex, nuanced tradeoffs. Yet, despite the challenges, it’s important to codify a platform’s values and make a good-faith effort to implement them.

Previous: Content Quality



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store