NIPPING IT IN THE BUD: Mitigating Opinion Biases in the Crowdsourced Collection of Subjective Judgments
This post written by Ujwal Gadiraju, summarizes a research paper, authored by Christoph Hube, Besnik Fetahu, and Ujwal Gadiraju from the L3S Research Center, Leibniz University of Hannover, which is available here. The paper will be presented at the ACM CHI Conference on Human Factors in Computing Systems, the premier international conference of Human-Computer Interaction, on May 9, 2019 in Glasgow, Scotland, UK.
Microtask crowdsourcing is widely used to gather human input at scale, build training data for machine learning models, and for the evaluation of a variety of systems. Recent research has revealed that tasks which require human interpretation or analysis are one of the most popular type of crowdsourced tasks. In many scenarios, such interpretation tasks may be susceptible to the biases of the human labelers (commonly referred to as crowd workers when recruited through online crowdsourcing marketplaces). These biases can manifest due to various factors ranging from the diverse cultural backgrounds of crowd workers, their personal opinions on a topic, to their ideological or other group memberships.
Studies have shown that one’s political or ideological stance can widely influence one’s perception and interpretation of facts. In the light of the recent election campaigns in the USA and the widespread distribution and consumption of fake news around the world, it has become imperative to build intelligent methods to distinguish facts from fiction. Moreover, detecting whether a statement is neutral or opinionated is of paramount importance in the construction of neutral narratives in news and other media. For example, the largest open encyclopedia in the world, Wikipedia, subjects its editors to the principle of maintaining a ‘neutral point-of-view’ (NPOV). Those edits which fail to adhere to such a neutral perspective are deleted. At the core of building automated models that can tackle the problem of identifying an opinionated statement, lies the need for human-generated ground-truth. In such interpretation tasks, a worker’s awareness of possible biases that may be introduced due to their personal or ideological stances is crucial in providing noise-free judgments. For instance, a survey in 2007 showed that only 23% of the US population who politically identified with the Republican party, believe that humans have an influence on climate change.
Several statistical methods have been proposed to tackle systematic bias in crowdsourced data collections in a post-hoc manner (that is after the data is collected). In contrast, to tackle the problem of systematic bias seeping into worker judgments due to their opinions, we propose three novel methods that aim to nip such bias in the metaphorical bud. We aim to mitigate bias at the point of origin, that is during label acquisition from crowd workers.
Considering the bias detection task, workers in our experimental setup were first asked to self-report their position on a set of stances pertaining to controversial topics. For example, workers were asked to report the extent to which they agreed with the stance that ‘Abortion should be legal’. Next, they were tasked with labeling a given statement as being ‘Neutral’, ‘Opinionated’, or ‘Unsure’ when in doubt. The selected statements pertained to the controversial topics of abortion, feminism, global warming, gun control, and LGBT rights. We recruited 480 workers from the USA using the Figure-Eight platform, and used a balanced dataset of ‘pro’, ‘contra’, and ‘neutral’ statements; that is statements that support a stance, those that oppose a stance, and those that neither support nor oppose a stance respectively. These topics were specifically selected due to their nature of being commonly discussed in mainstream media in the USA, making it arguably easy for the citizens to have an opinion on these topics.
Proposed Methods — How?
1). Social Projection: We draw inspiration from prior works that resonate with the theory of social projection and the Bayesian-Truth serum, to propose methods that elicit honest responses from workers. Thus, we ask workers to respond to whether or not other workers would find a given statement to be opinionated as opposed to asking them for their own judgment directly. The intended effect of doing so, is that workers would inadvertently and more honestly represent their own beliefs by projecting them onto others.
2). Awareness Reminders: A general consensus across research communities that are currently studying implicit and explicit biases that humans elicit, is that creating an awareness of the existing bias is the first step towards mitigating the said bias. We put this to test by exploring whether drawing the attention of workers to their opinions potentially interfering with the bias detection task, can mitigate the systematic bias stemming from workers’ opinions.
3). Personalized Nudges: Having prior knowledge of the stances that workers have with respect to a given topic, can allow us to present personalized instructions to workers. We leverage such priorly gathered information to alert opinionated workers about their strong support or opposition to given topics, asking workers to be ‘extra-careful’ while judging the statements related to those topics.
Main Findings — What did we learn?
We compared these proposed methods to a baseline condition where no intervention was used, a situation that is typical of general crowdsourcing practice in several marketplaces such as Amazon’s Mechanical Turk or Figure-Eight. We introduced a measure for bias, based on the misclassification of ‘pro’ statements as being ‘neutral’ and ‘con’ statements as being ‘neutral’ by crowd workers. We found that the total bias measured in the Baseline condition (7.85) was the largest, in comparison to the Personalized Nudges (4.47), Social Projection (4.37), and Awareness Reminders (4.00) conditions. Thus, creating an awareness among workers regarding their susceptibility to their opinions was the most effective method to mitigate systematic bias stemming from worker opinions in this task.
We found that strongly opinionated workers (as deduced through self-reported stances on the different topics) contributed the most to the total bias measured across all conditions. Workers consistently tended to misclassify a ‘pro’ statement as being ‘neutral’ more often than they misclassified a ‘con’ statement as being ‘neutral’. We aim to investigate this observation further in our future work. Finally, our results suggested that experience in crowd work does not alleviate a worker’s susceptibility to systematic bias stemming from one’s own opinions. All workers are equally susceptible to their strong opinions resulting in biased judgments.
Through our work, we showed that it is possible to mitigate biases stemming from worker opinions, by using design interventions while gathering labels from the crowd. In the future, we plan to explore how post-hoc methods (after data collection) can be combined with the methods proposed in this work to further mitigate biases in crowdsourced judgments.
Full citation: Christoph Hube, Besnik Fetahu, and Ujwal Gadiraju. Understanding and Mitigating Worker Biases in the Crowdsourced Collection of Subjective Judgments. In Proceedings of the 37th Annual ACM Conference on Human Factors in Computing Systems, CHI 2019.