Group (Design) Think

Published in

VisUMD

4 min readSep 12, 2019

How crowdsourcing can be used to evaluate visualization on the internet.

A group puts hands in the middle of the table containing several computers. Crowdsourcing. — Photo by rawpixel.com from Pexels

What if you could cut your experiment costs by a factor of 9? What if you could expand your sample pool by thousands? What if you could reduce your time for experiment by weeks? Would you be able to trust your results? In their article Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design, Jeffery Heer and Michael Bostock assess just that by performing visualization experiments in the Amazon Mechanical Turk environment and comparing their results to previous studies.

Crowdsourcing is the practice of drawing on external participants on the internet to create goods or services. Amazon Mechanical Turk is one of the oldest and most established marketplaces for paid crowdsourcing, where so called Turkers (or crowdworkers) earn micropayments (from around $0.10 to several dollars) to conduct specific tasks. The tasks are called “HITs”, which is short for human-intelligence tasks. The basic idea behind this work is to use Mechanical Turk to run HCI and visualization user studies, where the HITs are actually trials in these studies.

In their recreations of four previous visualization studies, Heer and Bostock were able to show similar results across several thousand responses. They went into their experiment unsure of the validity of the results they would get, but made a concerted effort to create as clean an experimental environment as they could outside of a lab. They accounted for user understanding through a qualification test, resulting in over 95% of participants to move to the experiments, rather than the 10% drop-off they ran into when they didn’t use a qualification step. They collected information about participants in non-obtrusive ways. They were even able to identify some ways to account for participants working on different machines, operating systems, and in different physical environments.

Their experiments covered proportional judgment, luminance contrast, and rectangular area judgments. Their assessments showed that participants returned results that met expectations developed from similar experiments conducted in-person. Results for all three suggest that crowdsourced results were as accurate as in-person results, and their rejection rate due to outliers was small. Generally, participants finished each human intelligence task (HIT) in a given set, so their data could be used as a single response by each participant.

Higher rewards show faster finished experiment runs, but all projects showed a faster-than-expected time to completion for human intelligence tasks (HITs).

The team could not always account for individual differences in display or the viewing angle of participants. In their attempts to reduce the impacts of these, they were able to find optimal display parameters for use in experiments conducted online, and were even able to find effect presented by operating system. They were able to incorporate their own JavaScript into the experiment, and used this to gather information about operating systems and monitors.

So what did they conclude? Heer and Bostock recommend incorporating crowdsourcing into your experimental design, with some caveats. The program introduced significant cost savings. Even accounting for their attempt to pay at the minimum wage rate, a $.02 reward saves over typical compensation of $15/participant. They analyzed accuracy results across various reward amounts, and found that, although there is some effect on accuracy, paying more didn’t really influence the accuracy in a meaningful way. The format reduced the time their experiments took both to administer and run logistics. That led the team to conclude that it offers a way to expand study designs. The time savings allow for greater range of replicated experiments. Mechanical Turk allowed for access to a wider audience, and retained consistency and validity in their results. They suggest creating unique criteria for participants. They saw anomalies in time to answer that they couldn’t reliably account for. They didn’t have consistent workers when experiments were conducted over time. They have some questions about accessibility. Due to these limitations, they recommend a combination of in-person and crowdsourced experiments.

This article was published in 2010, but its findings remain salient today. A look at Mechanical Turk now shows that it still has a large number of users, still offers cost and time savings, and has introduced more templates, masters qualified participants, and requires only minimal coding to use.

Jeffrey Heer and Michael Bostock (2010). Crowdsourcing graphical perception: using Mechanical Turk to assess visualization design. In Proceedings of the ACM Conference on Human Factors in Computing Systems, pp. 203–212, 2010.

Group (Design) Think

Written by Sarah V