The challenge of identifying subtle forms of toxicity online

Dec 12, 2018 · 10 min read

Beginning this year, the Rhodes Artificial Intelligence Lab (RAIL) — a group of graduate students at the University of Oxford with backgrounds across a spectrum of mathematical, computational and social sciences — has collaborated with Jigsaw to develop new machine learning models for promoting healthy conversations online. The following is a guest post from members of the RAIL team.

Imagine yourself sitting in your kitchen after a long day at work. You are a summer associate at a law firm in New York. You have spent the day reading depositions, or whatever it is that trainee lawyers do. Now, late at night, you are scrolling through a lengthy comment thread under an article in the New York Times, a review of the last season of Suits. One commenter really gets to you, and before you know it, you start typing the following response to what you see as his misinformed criticism of lawyers:

“What do you know about being a lawyer? We draft depositions, interview clients and witnesses, research cases, and manage and train the next generation of lawyers and paralegals. Don’t confuse what you watch on TV with the world outside your basement.”

Done. You press enter and take a deep breath. You do not think for a moment how the person to whom you have directed your comment might feel or react — you have said what, in your view, needed to be said.

Now, imagine this — as you post your comment, a dialogue box appears with a simple message:

Your comment is likely to be perceived as condescending by other users, and will be assessed by a moderator. Do you wish to rephrase it?

Of course, some users might ignore this and post their comment regardless. Others, however, will take a moment to rephrase their comment — and perhaps this simple exchange could foster a conversation which is better for all involved.

Anyone who has participated in online conversations on social media, in comment sections or forums will know that the internet can sometimes be a threatening place. For more than a decade, a debate has raged about the role of online conversations as a new realm for social interaction and public life. On the one hand, the internet represents a new transnational space for dialogue between people who would never otherwise encounter one another, and has the potential to support robust deliberative democracy and greater openness. On the other hand, though, many people are deterred from engaging online by content that is vicious, hurtful, abusive, or derogatory. As online conversations have proliferated, so have their dangers.

A number of organizations have taken on the challenge of trying to mitigate these online abuses. Jigsaw’s Conversation AI team has developed machine learning models and tools for assisting moderators to filter toxic comments which are already helping to foster healthier engagements online. Twitter has also committed to working towards healthier conversations on its platform.

The opening example above resembles many in the dataset we have been curating over the past few months. But few existing measures are able to identify its potential negative impact, and yet it is precisely the sort of content that can trigger the deterioration of a conversation.

The reality is that most forms of toxicity are subtle and often ambiguous, rather than obvious or extreme. While some comments might be obviously derogatory, threatening, or violent, and others are clearly respectful or light-hearted, most of those comments which deter people from engagement and create downward spirals in online interactions seem to lie somewhere in-between. This is especially the case with conversations online, many of which (i) take place ‘publicly’, in a forum that is visible to thousands of others, and (ii) involve strangers who have never met and know little about one another. These contextual differences can sometimes enhance the negative impact of subtler forms of toxicity like sarcasm, condescension or dismissiveness.

Identifying the more subtle indicators of problematic online conversations, however, is a difficult task. There are three reasons for this. First, they are by definition less extreme and more ambiguous. Second, they are often ambivalent — a comment that seems to make a substantive point might be made in an inflammatory way, and will therefore contain what many would perceive as both positive and negative aspects depending on the wider context, norms, and expectations. Third, there is an even greater risk of identifying “false positives” and “false negatives”, since many of the mechanisms used in subtle forms of toxicity can also be deployed for light-hearted effect or to make an effective point. For example, sarcasm is often used in derisive or bullying ways, but it can also be used in humorous and clever ways.

The challenge, then, is to identify the subtle characteristics of harmful or abusive comments online despite their ambivalence and ambiguity, without falsely identifying too many “good” comments.

Our starting point is an effort to understand what type of comment we are concerned with. We differentiate between two categories. The first, which is the most well studied to date, are those whose explicit intention is to insult, threaten, or abuse. The effect of these comments is to destabilise and push people out of the conversation, eliciting an annoyed or offended response from other commenters, or propagating some form of hatred. The presence of these comments is deleterious to healthy, robust, and constructive engagement.

There is a second type of comment, though, which we have found to be roughly 3 times more prominent in most online conversations. These comments are more likely to be intended to engage with others, share an opinion, or contribute to the conversation. They may, however, do so in a way that antagonises or deters others.

Our goal in this work is to identify different facets of the more nuanced toxic comments. If we are able to do so effectively, then we open the door to developing new kinds of user experiences based on machine understanding of them. These include moderation assistance, viewership control tools, and new kinds of authorship nudges, like we saw in the opening example above.

Our data collection methodology begins by asking crowd-workers to measure the overall level of “healthiness” in a comment. The goal is not to be overly prescriptive, since a conversation does not have to be “nice” or “constructive” in order to be healthy. We want to allow for many different types of conversation as long as they meet a basic standard of healthy engagement.

We define five ‘sub-attributes’ which we hypothesise are the main characteristics of unhealthy content online, and which may overlap in any combination. We label comments that are (1) hostile, antagonistic, insulting, provocative or trolling; (2) dismissive, (3) condescending or patronising, (4) sarcastic, and (5) unfair generalisations.

In many cases, unhealthy online conversation is not the result of the substance of the comments — that is, the ideas they contain — so much as the way in which that substance or idea is conveyed. Of course, some comments consist entirely of the intention to troll, insult, or provoke (comments which fall under our first sub-attribute). However, many others are — in substance — contributing ideas to the conversation or expressing an opinion, but doing so in a way which harms the conversation — e.g. responding to a point one disagrees with in an overtly condescending way. The last four attributes capture the different ways in which otherwise fair and positive contributions to conversations can be unhealthy.

Together these sub-attributes account for the majority of “unhealthy” comments we considered, but there will be some comments that are “unhealthy” but do not display any sub-attribute, and also a few which are “healthy” despite representing one or more sub-attributes. We believe these attributes enable us to capture most of the more subtle kinds of toxic language discussed above. The way we visualize the interaction between these categories is as follows:

Data collection

If we hope to build machine learning models capable of predicting how comments will be understood and perceived by readers, the first stage is to create a dataset (of tens or hundreds of thousands) of comments, each labelled with the degree to which it represents the five sub-attributes, and the degree to which it has a place in healthy conversation. This has been the primary focus of our efforts in the past couple of months. We have spent time developing the design of jobs on the Figure-Eight platform to outsource the labelling of these comments to ‘the wisdom of the crowd’, and taking measures to help ensure that the resulting data is of high quality. Though we are currently only at about 25K labelled comments so far (out of an initial target of around 80K), the results are promising.

The dataset being labelled comprises comments from the Globe and Mail news site (the SFU Opinion and Comment Corpus) and Wikipedia talk page comments. Annotators are asked to read a comment, and (with some context, instructions and functional definitions along the way) to answer the following questions about that comment:

  1. Do you think this comment has a place in a healthy online conversation?
  2. Is this comment sarcastic?
  3. Does this comment make a generalisation about a specific group of people?
  4. If yes, would a member of that group feel that the generalisation is unfair?
  5. Is this comment needlessly hostile?
  6. Is the intention of this comment to insult, antagonize, provoke, or troll other users?
  7. Is this comment condescending and/or patronising?
  8. Is this comment dismissive?

During this process, annotators are regularly tested (presented with a comment for which one or more question has a specified set of correct answers that we pre-identified), and those who do not maintain a high level of accuracy are excluded, along with all of their previous answers.

The results seem to be capturing some important nuances. For example, annotators distinguished between hostile language and the intention to insult/antagonise another user when annotating the comment shown in Figure 1a. Crowd-worker annotations also seem to be able to identify subtle examples of condescending and patronising comments (Figure 1b), and obtain high inter-annotator agreement on more intuitive attributes like ‘dismissiveness’ (Figure 1c).

Figure 1a. A needlessly hostile comment which nonetheless did not intend to insult, antagonize, provoke or troll other users. The figure shows two of the questions annotators were asked about each comment. The aggregated responses of the three to five annotators of that particular comment are shown on the right — where the confidence score is determined by weighting the responses of each annotator by their track-record or ‘trustworthiness’.
Figure 1b. Subtle condescension.
Figure 1c. Annotator consensus on a dismissive comment.

The data is not without its flaws, as you might expect when dealing with categories this nuanced. Despite providing definitions for most of the terms, we also relied on the intuition and judgement of our annotators to correctly classify comments. This led to many gray areas: what exactly constitutes “a specific group of people”, or what is the difference between a rhetorical question and a sarcastic one. We saw wide disagreement on examples like “Are you personally going to take on ISIS?” (which was narrowly considered sarcastic) as opposed to “Have you been drinking?” (which was narrowly considered not). When this data is used to build models, it is likely that the models will not give clear results for these types of examples.

Another challenge which comes with annotating more subtle unhealthy attributes is the potential to encode unintended societal biases and value judgements in models which learn from the data. Sarcasm, for example, is often communicated online by commenting something which the author presumes to be so obviously untrue that it will be read as clearly sarcastic. These presumptions reflect the author’s biases — or in the cases of comment annotation, labelling comments as sarcastic reflects the annotators presumptions of what is obviously untrue. Figure 2 shows an example of how a particular worldview or bias might be enshrined in the data: finding this comment sarcastic relies on an assumption that in fact Iran and Turkey are clearly not good places to be women. Our annotators are not chosen to be universally representative across language, geography, culture, or other demographic attributes and this assumption is not universal. We must take great care in how our models incorporate and generalise these types of sentiments, since they can contribute to significant and problematic issues around unintended bias. As a team, we will continue to focus on these areas, developing methods to measure them, and working to mitigate them. For more information on how this is being done, see our blog post and research on github that focuses on this issue.

Figure 2: Crowdsourcing annotation can encode societal biases

What’s Next?

Over the coming few months we plan to have the remaining 55,000 comments annotated to complete our dataset. Once we have had a chance to do some initial data analysis and model building, we will post our results along with links to the open-sourced dataset and models. To our knowledge this will be the first dataset at this scale to capture these sorts of subtle attributes that can help distinguish different kinds of unhealthy comments. We believe that it will be a useful contribution to broader efforts to promote healthier conversations online, as well as to the study of deliberative democracy and our understanding of the role of online conversations. We hope that this will serve as a springboard for further research on these and other questions. Do let us know if you come across related datasets and methods — we’re always looking for others to collaborate with on helping to foster good conversations at scale.

Look out for updates and the release of the final dataset early in 2019. In the meantime, if you have ideas or are working on similar topics, we’d love to hear from you.

Authors: Ilan Price, Saul Musker, Maayan Roichman, Jordan Gifford-Moore, Guillaume Sylvain, Jory Fleming, Nithum Thain, Lucas Dixon


The Rhodes Artificial Intelligence Lab (RAIL) is an interdisciplinary group at the university of Oxford who believe that Machine Learning and Artificial Intelligence can be a force for good. They collaborate with NGOs, governments, companies and other partners on social impact projects with a machine learning dimension. They are always on the lookout for new partnerships and projects, so get in touch at

The False Positive

Exploring machine learning for better online conversation

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store