Crowdsourcing annotations for subjective NLP tasks: Disagreement is signal, not noise

Crowdsourcing annotations, why?

Common practices in crowdsourcing annotations

  • Annotation artifacts: A form of spurious correlation that comes from the way crowd-workers have annotated the data. In Natural Language Inference it occurs when the label “contradiction” can be inferred based on the presence of negation words in the hypothesis.
  • Social bias: Stereotypical associations between terms that can derive from both the data itself and the crowd-workers/annotators.
  • Spurious correlations: Being able to predict a label given a feature that correlates with the label even though it should not be a determining factor for assigning the label.
  • And lastly, annotator disagreements, in connection to 1) quality control, 2) disagreements being genuine, and 3) validation and filtering out poor quality data.

Disagreements are genuine

Disagreements and demographic fairness

Cohen’s kappa with majority vs number of annotators. Prabhakaran et al. (2021) page 135.
Prabhakaran et al. (2021)
  1. In subjective tasks, there is not always a single correct label and the perspective of individual annotators may be as valuable as an ‘expert’ perspective.
  2. The aggregation step may potentially under-represent certain groups’ perspectives.

Questions to live with (for now)

Take-away points

  • Agreement is not always a good measure of data quality. Disagreements are often genuine in subjective NLP tasks and should therefore be seen as signal, not noise.
  • Always be cautious about throwing data away and always keep record of the original data.
  • Be cautious when others throw away data due to disagreement since this may bring unrealistic high performance scores.
  • If possible, check your gathered data for systematic disagreements that may be linked to socio-demographic differences.
  • When releasing you crowdsourced annotations: release all annotations rather than just aggregated annotations, and release annotator demographics when possible.
  • Collecting more annotations per instance is of course better than collecting only one, but when aggregating annotations, ask yourself this question: Does it make sense for my task to have a single “ground truth” label for each instance?


Other relevant papers




PhD student at Copenhagen Center for Social Data Science (SODAS) and CoAStAL NLP

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Use Your Data to Get Results — Not Just to Have It Around

Please stop giving Take-Home Assignments to Data Science Candidates

Onboarding a New Role as a Manager

Topic Modeling in Python: Latent Dirichlet Allocation (LDA)

Random Thoughts Of A Wannabe Data Scientist 👨‍🔬

Data:Value — trust, ethics, equity and the needs of society.

House Prices Prediction With Regression

Sinking in data? 3 data mistakes you’re making!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Terne Sasha Thorn Jakobsen

Terne Sasha Thorn Jakobsen

PhD student at Copenhagen Center for Social Data Science (SODAS) and CoAStAL NLP

More from Medium

Reinforce AI Conference 2022 Starts Tomorrow

How does NLP work?

An Introduction to Machine Learning Theory

Am I Allergic to This? Developing a Voice Assistant for Sight Impaired People