What if you could cut your experiment costs by a factor of 9? What if you could expand your sample pool by thousands? What if you could reduce your time for experiment by weeks? Would you be able to trust your results? In their article Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design, Jeffery Heer and Michael Bostock assess just that by performing visualization experiments in the Amazon Mechanical Turk environment and comparing their results to previous studies.
Crowdsourcing is the practice of drawing on external participants on the internet to create goods or services. Amazon Mechanical Turk is one of the oldest and most established marketplaces for paid crowdsourcing, where so called Turkers (or crowdworkers) earn micropayments (from around $0.10 to several dollars) to conduct specific tasks. The tasks are called “HITs”, which is short for human-intelligence tasks. The basic idea behind this work is to use Mechanical Turk to run HCI and visualization user studies, where the HITs are actually trials in these studies.
In their recreations of four previous visualization studies, Heer and Bostock were able to show similar results across several thousand responses. They went into their experiment unsure of the validity of the results they would get, but made a concerted effort to create as clean an experimental environment as they could outside of a lab. They accounted for user understanding through a qualification test, resulting in over 95% of participants to move to the experiments, rather than the 10% drop-off they ran into when they didn’t use a qualification step. They collected information about participants in non-obtrusive ways. They were even able to identify some ways to account for participants working on different machines, operating systems, and in different physical environments.
Their experiments covered proportional judgment, luminance contrast, and rectangular area judgments. Their assessments showed that participants returned results that met expectations developed from similar experiments conducted in-person. Results for all three suggest that crowdsourced results were as accurate as in-person results, and their rejection rate due to outliers was small. Generally, participants finished each human intelligence task (HIT) in a given set, so their data could be used as a single response by each participant.
So what did they conclude? Heer and Bostock recommend incorporating crowdsourcing into your experimental design, with some caveats. The program introduced significant cost savings. Even accounting for their attempt to pay at the minimum wage rate, a $.02 reward saves over typical compensation of $15/participant. They analyzed accuracy results across various reward amounts, and found that, although there is some effect on accuracy, paying more didn’t really influence the accuracy in a meaningful way. The format reduced the time their experiments took both to administer and run logistics. That led the team to conclude that it offers a way to expand study designs. The time savings allow for greater range of replicated experiments. Mechanical Turk allowed for access to a wider audience, and retained consistency and validity in their results. They suggest creating unique criteria for participants. They saw anomalies in time to answer that they couldn’t reliably account for. They didn’t have consistent workers when experiments were conducted over time. They have some questions about accessibility. Due to these limitations, they recommend a combination of in-person and crowdsourced experiments.
This article was published in 2010, but its findings remain salient today. A look at Mechanical Turk now shows that it still has a large number of users, still offers cost and time savings, and has introduced more templates, masters qualified participants, and requires only minimal coding to use.
- Jeffrey Heer and Michael Bostock (2010). Crowdsourcing graphical perception: using Mechanical Turk to assess visualization design. In Proceedings of the ACM Conference on Human Factors in Computing Systems, pp. 203–212, 2010.