Efficient Elicitation Approaches to Estimate Collective Crowd Answers
This post summarizes a research paper that studied how different answer elicitation methods efficiently estimate the distribution of other people’s answers. This paper will be presented at the ACM Conference on Computer-Supported Cooperative Work and Social Computing on November 11th (as a Best Paper Honorable Mention recipient).
The most common approach to collecting data labels for machine learning (ML) is to find a single “best” answer. This makes it easy to collect data at scale via crowdsourcing, while still being able to do quality control via techniques like majority voting (i.e., selecting the most agreed-upon answer). While this widely-used approach has been effective in helping people collect accurate labels for data with objective answers, it does not work well if the data is ambiguous and subject to interpretation, such as labeling what emotion is displayed by a person’s facial expression or deciding if two words are conceptually related. In such cases, people will disagree on the best labels, and collapsing a diverse set of labels to the single most agreed-upon one would not convey the true information about people’s varying interpretations.
For inherently ambiguous data, an answer distribution — the distribution of people’s collective labels — would better represent diverse interpretations of people than a single best answer. For example, if an ML model that classifies a user’s emotion is only trained on data with single best labels, it would fail to detect some of the possible emotional states that people may interpret, while those trained on answer distributions would more easily recognize all possibilities.
However, one of the bottlenecks in using answer distributions is the cost of collecting such annotations, as accurately retrieving the proportion of people who would select each label requires more people than finding the most-agreed single best answer.
We investigated ways to more efficiently elicit answer distributions by asking for richer answers from each crowd worker. In the task of annotating positive/negative aspects of the emotion from facial images, we examined and compared eight approaches along two dimensions: annotation granularity (the amount of information elicited from each worker), and estimation perspective (whether workers are asked to answer with their own perspective or estimate answers of others). For annotation granularity, we expected that with more information provided from workers, we would be able to estimate the answer distributions with fewer workers and less amount of human labor time. For estimation perspective, we expected that crowd workers would better estimate collective answer distributions by estimating other people’s perspectives.
Surprisingly, the most fine-grained annotations were not the most accurate in estimating answer distributions. Among all approaches, the most accurate approach was allowing workers to choose multiple answers while estimating other people’s answers. When we set the total human time to be the same across different approaches, only this approach and choosing multiple answers with a worker’s own perspective could significantly outperform the baseline of choosing a single answer with a worker’s own perspective. In fact, the best approach could reduce the number of people required by 40% compared to the baseline and reduce the human time required by 21.4%.
But why do the most fine-grained approaches to eliciting information not result in the most accurate answers? With additional analysis, we found that when crowd workers annotated probabilities, they skewed the probability to the fewer number of categories than how it is distributed in the gold standard distribution.
Our work suggests that it is possible to crowdsource answer distribution more efficiently by receiving richer answers. As a result, this work helps make it more feasible to use answer distributions for training ML algorithms with ambiguous data. The CSCW community should keep pursuing this goal, as it would enable the training of machine learning models that reflect perspectives of collective annotators, rather than a narrow perspective of an individual.
Full citation: John Joon Young Chung, Jean Y. Song, Sindhu Kutty, Sungsoo (Ray) Hong, Juho Kim, and Walter S. Lasecki. 2019. Efficient Elicitation Approaches to Estimate Collective Crowd Answers. Proc. ACM Hum.-Comput. Interact. 3, CSCW, Article 62 (November 2019), 25 pages. https://doi.org/10.1145/3359164