Exploring the Role of Human Raters in Creating NLP Datasets

Published in

Jigsaw

4 min readNov 19, 2019

By: Daniel Borkan, Jeff Sorensen, Lucy Vasserman

In April of 2018, we launched a Kaggle competition where we challenged the competitors to build a model that recognizes toxicity and minimizes unintended bias with respect to mentions of identities. For the competition, we released the largest known public dataset of comments with toxicity labels and identity labels for measuring unintended bias. We share datasets as a way to encourage and enable research that benefits the entire industry, and this data has already sparked some exciting research. To continue that momentum, today we are expanding this dataset by releasing the individual annotations from human raters. You can download the dataset here.

A new dataset labeled for identity-related content

For the competition we released most of a labeled dataset of more than two million comments, which the Civil Comments platform published in 2017. To enable a fair competition, we kept a subset of these comments secret to use to test competitor’s models. Today we’re also making this dataset complete by releasing the full test set.

Jigsaw expanded this dataset to include toxicity labels from human raters. Additionally, we had approximately 250,000 comments labeled for identities. For these comments, raters were asked to indicate all identities that are mentioned in the comment. An example question that was asked as part of this annotation effort was: “What genders are mentioned in the comment?” In addition to gender, comments were annotated for references to sexual orientation, religion, race/ethnicity, disability, and mental illness.

Sample comments mentioning identity and scored for toxicity

Conversation AI measures each of our models for unintended bias. Before we had this dataset, we measured using synthetically generated comments from simple template sentences. While synthetic comments are easy to create, they do not capture any of the complexity and variety of real comments from online discussion forums. By labeling identity mentions in real data, we are able to measure bias in our models in a more realistic setting, and we hope to enable further research into unintended bias across the field.

A closer look at annotation and raters

We’re especially excited today to share the individual annotations from almost 9000 human raters. Human annotation is a deeply fundamental part of most machine learning models, and for Perspective, it’s what teaches the model what is toxic. We recognize that toxicity can be subjective, so each comment is shown to three to 10 raters (though some comments are seen by up to thousands of raters due to sampling and strategies used to improve rater accuracy). We then train our models to predict the probability that an individual will find the comment toxic, so if seven out of 10 people rate a comment as “toxic”, the model is trained to predict 0.7, or a 70% likelihood that someone will find the comment toxic.

Here’s how toxicity was scored in the original Kaggle data:

Original Kaggle Data

But, what if we can improve our models by aggregating individual annotations differently? The new dataset will break down the score by rater:

New Rater Data

Rater disagreement may serve as a signal in the data, and we may be able to improve our models by weighing individual raters differently based on expertise or background. Raters are humans after all, and they each have individual skills and experiences. Maybe one rater is excellent at judging toxicity in conversations around sports, and another is more familiar with political discourse. Or, when rating comments that mention a certain identity, perhaps a rater who identifies as that identity is best suited to rate that comment. By releasing the individual annotations on the Civil Comments set, we’re inviting the industry to join us in taking the first step in exploring these questions.

What’s Next

Building effective models and capturing the nuance of human opinion is a complex challenge that can’t be solved by any one team alone. We’re sharing this dataset and individual annotations to enable academic and industry research into mitigating unintended bias, understanding human annotation processes, and building models that work well for a wide range of conversations. We’re excited to see what we learn.

Daniel Borkan, Jeff Sorensen, and Lucy Vasserman are software engineers at Jigsaw.

Acknowledgements

The Conversation AI team would like to thank Civil Comments for making this dataset available publicly and the Online Hate Index Research Project at D-Lab, University of California, Berkeley, whose labeling survey/instrument informed the dataset labeling. We’d also like to thank everyone who has contributed to Conversation AI’s research, especially those who took part in our last competition, the success of which led to the creation of this challenge.

Exploring the Role of Human Raters in Creating NLP Datasets

Written by Jigsaw