Crowdsourcing annotations for subjective NLP tasks: Disagreement is signal, not noise
The latest EMNLP (’21) conference featured a tutorial titled Crowdsourcing Beyond Annotation: Case Studies in Benchmark Data Collection that was motivated by the fact that crowdsourcing annotations, so far, has been mostly guided by common practises and personal experiences, as there is very limited research on these practises to lean on. In this blogpost, I will highlight some findings on annotation practises that some may regard as obvious. In particular, I will explain the role of (dis)agreement in annotations and how it drives decisions on the removal and aggregation of data points, affecting the performance and fairness of models trained on this data. I will argue that removing low-agreement data may bring unrealistic high performance scores and that it is senseless for tasks where disagreement is genuine and can provide a measure of ambiguity.
The main message I hope readers take from this post is beautifully summarized in this quote:
disagreement should be seen as a signal and not as noise. (Leonardelli et al., 2021)
In the following sections, I will firstly set the scene for the novice Natural Language Processing (NLP) researcher or student who may be wondering why we crowdsource annotations (and how annotations are defined) and give examples for the most common practises. If you are already familiar with the subject, you can go directly to the section titled “Disagreements are genuine”.
Links throughout the blogpost will direct you to the specific sections in the EMNLP tutorial mentioning the respective topics.
Crowdsourcing annotations, why?
Common to supervised machine learning/NLP tasks is the need for annotated data, i.e. a set of data points judged by humans to attain labels. To develop robust NLP systems, we want trustworthy annotations, and preferably lots of them. Depending on the task at hand, some researchers may turn to crowdsourcing annotations because hiring ‘experts’ is too expensive, while others may use it because they have no reason to believe their own ‘expert’ annotations are better than the wisdom of the crowd or, in some cases, better than any single judgement of a ‘non-expert’. The latter type is especially related to the annotation of tasks that are relatively subjective, such as toxic language and hate speech detection (asking annotators to judge whether a given text or speech contains toxic or hateful language), sentiment classification (asking annotators to judge the sentiment of given text or speech, commonly as positive, neutral or negative), opinion classification (e.g. asking annotators to label a text as expressing a liberal, moderate or conservative opinion) and any other task where it is reasonable to believe that individuals may perceive and annotate instances differently, depending on their own beliefs and biases.
The downside of crowdsourcing is that the “trustworthy” criteria gets challenged: some crowd-workers may optimise their pay by going quickly through the job without reading the instructions, and as they explain in the tutorial, workers may not always have much incentive to do a good job. This can produce noisy data, as it may not always be easy to identify and remove annotations produced this way. The hope is that with enough data, and enough different annotators (crowd-workers), the effect of such noise will be small. In the next section, I will explain the common practise of aggregating annotations to improve the quality of your data and avoid the impact of noisy labels. Following this, I will start explaining the negative effects of such aggregation that have, so far, been found for annotations of relatively subjective tasks.
Common practices in crowdsourcing annotations
Say we want to do sentiment classification where we want to classify a sentence as positive, neutral or negative. We therefore want to hire crowd-workers to annotate a list of sentences with these three possible labels. The most common platform for crowdsourcing annotations is Amazon Mechanical Turk (other platforms include Prolific, Upwork, FigureEight and more). We write a short, but thorough guideline for our crowd-workers with clear examples of the three sentiments, such that they can confidently figure out the correct label for the instances they annotate. We worry that some workers will not follow the guideline, and therefore produce noisy labels, so we try to control the quality of our annotations by having multiple workers annotate each instance and then we aggregate the annotations by the most common approach — taking the majority vote — to derive at, what we believe to be, a more confident final label. Unfortunately we may not be able to afford having all our sentences annotated three or five times (we need an uneven number to take a majority vote of course) so we may choose to only get a subset of the data annotated multiple times. This subset will be our test set. In some cases, we may instead hire new workers to validate the annotations. Lastly, we may take low agreement between workers on any given instance as an indication that either the instance is a hard/ambiguous case and/or that the annotation quality is low: in any case, it is common to filter out (i.e. remove) such instances from the data entirely.
In the EMNLP crowdsourcing tutorial they mention, among others, all these common practises outlined in my example above. They also mention several challenges and known issues in gathering annotations, most importantly:
- Annotation artifacts: A form of spurious correlation that comes from the way crowd-workers have annotated the data. In Natural Language Inference it occurs when the label “contradiction” can be inferred based on the presence of negation words in the hypothesis.
- Social bias: Stereotypical associations between terms that can derive from both the data itself and the crowd-workers/annotators.
- Spurious correlations: Being able to predict a label given a feature that correlates with the label even though it should not be a determining factor for assigning the label.
- And lastly, annotator disagreements, in connection to 1) quality control, 2) disagreements being genuine, and 3) validation and filtering out poor quality data.
For the last point, they refer to Leonardelli et al. (2021), saying that filtering out low agreement data may not always be a good idea. In the next section, we take a closer look at this paper to understand why and when such filtering can be problematic!
Disagreements are genuine
Leonardelli et al. (2021) look at how different levels of agreement affect classification performance for the task of offensive language detection. Through this, they are able to make critical observations about the data selection process where data points are filtered out because of low agreement between annotators. They divide their data into sets of unanimous agreement (A++: 5/5 annotators agree), mild (A+: 4/5 annotators agree) and weak agreement (A0: 3/5 annotators agree). Unsurprisingly, training with high agreement data, A++ and A+, produces better classification performance on similar data. More importantly, they find that the performance of models trained on high agreement data is remarkably lower when tested on A0 data and that high performance on benchmark data is sometimes due to the lack of low agreement data in the test sets.
When the authors take a closer look at the instances of weak agreement, it is evident that many of these instances represent ambiguous cases. They note that, “Overall, disagreement does not seem to stem from poor annotation of some crowd-workers, but rather from genuine differences in the interpretation of the tweets.” (Leonardelli et al., 2021) Because of these findings, the authors recommend the inclusion of more low agreement data in benchmark data.
Following the findings of Leonardelli et al., I would urge researchers to be more wary about removing data due to low agreement, as part of the common practises to improve data quality. If the intention is to apply models to real world data, we should instead learn from the instances that annotators genuinely disagree on and make models aware of such instances. This is specifically important for the type tasks that seem more subjective — where it sometimes does not make sense to say that there is one “correct” label for an instance because different perspectives may result in different judgements.
Disagreements and demographic fairness
While Leonardelli et al. (2021) focuses on the role of selecting data by agreement in undermining different viewpoints, a shorter workshop paper by Prabhakaran et al. (2021) focuses on the role of aggregation by majority vote, and dive into the question of whether the effective undermining of viewpoints is equal across some socio-demographic groups.
They include data for three relatively subjective binary classification tasks: sentiment, hate speech and emotion classification (with a binary classification task for each of six emotions), and compare each subset of annotations from each individual annotator to the annotations of the same instances given by majority vote. If majority vote was a fair representation then you would expect most annotators to have roughly the same agreement (calculated with Cohen’s kappa score) with the majority vote labels, with few noisy annotators being more off. However, Prabhakaran et al. find that, for some tasks, annotators do not have the same agreement. Interestingly, the tasks where annotator perspectives seem to be equally captured by majority are easy tasks: recognizing the emotions Joy, sadness, anger, and perhaps surprisingly also recognising hate speech. While the tasks that do not equally capture annotators’ perspectives in majority, seem to be harder: recognising disgust, fear and sentiment. Let us return to sentiment classification again as an illustrative example and look at the distribution of annotators along different Cohen’s kappa scores:

We see here that the majority vote has very low agreement with a substantial amount of annotators. Furthermore, when separating annotators into groups of different demographic attributes (gender, ethnicity and political observation), a significant difference in the agreement between majority vote and groups of different ethnicities is found, specifically between white and black annotators.
Although it is not clear from the paper whether annotators have annotated equal amounts of sentences each (their results include both kappa scores of 0.0 and 1.0, which could indicate that some scores are calculated based on very few annotations), and how differences in these numbers could affect the kappa scores, the paper makes some important observations such as:
- In subjective tasks, there is not always a single correct label and the perspective of individual annotators may be as valuable as an ‘expert’ perspective.
- The aggregation step may potentially under-represent certain groups’ perspectives.
Point 1 should be fairly clear by now, but I will add that this is important to remember when stumbling upon annotation quality evaluations based on comparing annotations to one or two ‘expert’ annotations. To understand point 2 better, have a look at this simple illustration:

On the left, we have a pool of available annotators with different demographic attributes. The demographic attributes of interest depend on the task and requires understanding of how background shapes perception and biases. If you are classifying political opinions in US tweets, you may take circle-heads to represent liberals, square-head to represent conservatives, and triangle-heads to represent moderates. If you prefer to stick to the sentiment case, they may represent individuals with different ethnic identities. The content of this pool depends, at the very least, on the chosen platform (Amazon Mechanical Turk, Upwork, Prolific etc.) and since workers self-select into tasks, the pool also depends on who may find the job interesting. We can therefore not expect this pool to be representative of the general population. Say the pool of annotators already over-represents some group (circle-heads) and for each instance we choose to have five individual annotators and take the majority vote of those five. Given we have more circle-heads in the sample, and assuming that circle-heads tend to agree more with other circle-heads than square-heads (which is a fairly safe assumption), we naturally end up in a situation where the majority vote will, more often than not, represent the judgements commonly made by circle-heads.
Prabhakaran et al. (2021) recommend that researchers release annotator-level labels rather than only the aggregated labels and, if possible to do so in a responsible manner, gather and release annotators’ socio-demographic information. This would allow others to investigate those potential systematic disagreements, ensure disagreements are not unnecessarily removed and gain insights to demographic differences and fairness among annotators.
Questions to live with (for now)
So far the studies focus on disagreement in subjective tasks. However, whether objectivity (free from personal biases) can ever truly be attained is an old philosophical question that perhaps deserves some attention in NLP. It could therefore be interesting to see studies similar to those outlined above, but in relation to tasks most people would describe as relatively objective. The same goes for the question of how sensitive ‘expert’ annotations are to their own biases in both ‘subjective’ and ‘objective’ tasks.
Take-away points
- Agreement is not always a good measure of data quality. Disagreements are often genuine in subjective NLP tasks and should therefore be seen as signal, not noise.
- Always be cautious about throwing data away and always keep record of the original data.
- Be cautious when others throw away data due to disagreement since this may bring unrealistic high performance scores.
- If possible, check your gathered data for systematic disagreements that may be linked to socio-demographic differences.
- When releasing you crowdsourced annotations: release all annotations rather than just aggregated annotations, and release annotator demographics when possible.
- Collecting more annotations per instance is of course better than collecting only one, but when aggregating annotations, ask yourself this question: Does it make sense for my task to have a single “ground truth” label for each instance?
References
Leonardelli, E., Menini, S., Aprosio, A.P., Guerini, M., & Tonelli, S. (2021). Agreeing to Disagree: Annotating Offensive Language Datasets with Annotators’ Disagreement. EMNLP. 10.18653/v1/2021.emnlp-main.822
Prabhakaran, V., Davani, A.M., & D’iaz, M. (2021). On Releasing Annotator-Level Labels and Information in Datasets. Proceedings of The Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop. 10.18653/v1/2021.law-1.14
Other relevant papers
Pavlick, E., & Kwiatkowski, T. (2019). Inherent Disagreements in Human Textual Inferences. Transactions of the Association for Computational Linguistics, 7, 677–694.
Davani, A.M., D’iaz, M., & Prabhakaran, V. (2021). Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations. ArXiv, abs/2110.05719.