Crowdsourcing in Practice: Our Findings

Analysis of a Large Dataset from a Crowdsourcing Marketplace

This short blog post provides a sneak-peak into our VLDB ’17 paper titled “Understanding Workers, Developing Effective Tasks, and Enhancing Marketplace Dynamics: A Study of a Large Crowdsourcing Marketplace” (journal version, ArXiv version). Blog post written by Akash Das Sarma, lightly edited by me. Coauthors: Ayush Jain, Jennifer Widom.

Despite the excitement surrounding artificial intelligence and the ubiquitous need for large volumes of manually labeled training data, the past few years have been a relatively tumultuous period for the crowdsourcing industry. There has been a recent spate of mergers, rebrandings, slowdowns, and moves towards private crowds. So, our goal is to try to step back and study the state of current marketplaces, to understand how these marketplaces are performing, how the requesters are making and can make best use of these marketplaces, and how workers are participating in these marketplaces.

Figure 1. Key participants in crowdsourcing

We conducted an experimental analysis of a dataset comprising over 27 million micro-tasks performed by over 70,000 workers issued to a large crowdsourcing marketplace between 2012–2016. Using this data , we shed light on three crucial aspects: (1) Task design — understanding what makes an effective, well-designed task; (2) Marketplace dynamics — understanding the interaction between tasks and workers, and the corresponding marketplace load, as well as the types of tasks prevalent on the marketplace; and (3) Worker behavior — understanding worker attention spans, lifetimes, and general behavior.

We believe that this work serves as a first step towards building a comprehensive benchmark of crowd work, and for laying down guiding insights for the next generation of crowdsourcing marketplaces.

Dataset Overview

Some terminology first: we define a task instance to be fundamental unit of work, typically posed on one webpage (with one or more questions); an item is a piece of data that each question in a task operates on; and a batch is a set of task instances issued in parallel by a requester, differing only on items.

The dataset we study comprises of 27 million task instances distributed across 12 thousand batches. For each batch, we have the high level task description and the HTML source for one sample task instance in the batch. For task instances, we are aware of a number of important bits of information, including the start time, the end time, and the worker response. For each worker who has completed a task instance, we have the following information, among others: the location, the source of the worker, and the trust score.

Enriching the Dataset

To augment this data, we added three additional types of task attribute data:

  • Manual labels — we manually annotated each batch based on their (a) task goal, e.g., entity resolution, sentiment analysis, (b) human-operator type, e.g., rating, sorting, labeling, and (c) the data type in the task interface, e.g., text, image, social media.
  • Design parameters — we extracted features from the HTML source as well as other raw attributes of the tasks that reflect task design decisions. For example, we checked whether a task contains instructions, examples, text-boxes, and images.
  • Performance metrics — we computed different metrics to characterize the latency, cost and confusion of tasks to help us perform quantitative analyses on the “effectiveness” of a task’s design.

Selected Marketplace Insights

Let’s first look into the high level, aggregate workings of the marketplace. This is a small sample of what’s in the paper.

What types of tasks do we see on the marketplace?

We plot the task breakdown by goal in Figure 2a alongside; the breakdown by operator in Figure 2b; and the break down of each goal by operator in Figure 2c.

Figure 2a: Goals
Figure 2b: Operators
Figure 2c: Goals broken down by Operators

Looking at Figure 2a, we find that Language Understanding and Transcription are by far the most popular goals, comprising of over 4 million and 3 million task instances respectively, with the next closest task goal of Sentiment Analysis only spanning around 1.5 million instances. We also note that language understanding and transcription tasks are relatively challenging, and often involve complex operators as building blocks (see Figure 2c) — these are hard to automate, and therefore it may be worthwhile for marketplace administrators to consider training and maintaining skilled workers for such tasks.

Looking at Figure 2b, we find that classical operators, filtering and rating are frequently used with filtering alone being used in over 8 million task instances (over twice as many as any other operator), and are used to achieve nearly every goal (as we see in Figure 2c). Given that these two operators have been heavily studied by the crowdsourcing algorithms community and that they can be algorithmically optimized, it would be very valuable for marketplaces to package in known algorithms for task allocation and response aggregation into their platform.

How does the marketplace load and the resulting latency vary?

Figure 3. Observations about the flux in marketplace load and corresponding latency

We plot the tasks issued and pickup time in Figure 3. Overall, we see that the marketplace sees a high variation in the arrival of task instances in any given week, from 0.0004x to 30x of the median task load — — this means that the marketplace needs to be able to handle sudden large influxes of tasks, as well as be prepared for downtimes.

At the same time, a high load on the marketplace is usually accompanied by higher levels of activity and lower latencies, and vice versa. Thus, the marketplace is able to attract workers to service high periods of demand, indicating that the demand of workers is often not met with ample supply of tasks.

Do a few requesters dominate the marketplace?

Figure 4. Heavy hitters

To study this, we group task instances into clusters, where a cluster is defined as a set of batches with identical task descriptions and UIs issued across different points in time. We plot the issuing rate for the top clusters in Figure 4. It appears that a huge fraction of tasks and batches come from a few clusters; fine-tuning towards such clusters can lead to rich dividends. These “heavy hitter” task types have a rapid increase to a steady stream of activity followed by a complete shutdown, after which that task type is never issued again.

Task Design Recommendations

Next, we look at some task design choices, and quantitatively analyze the influence of these features on three quality metrics that measure task “effectiveness”: latency, cost (which can be approximated by task-time if we assume an hourly wage), and task confusion (the degree to which different workers disagree with each other).

Table 1: Impact of Images

Images (Table 1): we find that tasks without images have over 3x the latency of tasks with images, and about 40% higher task-times.

Table 2: Impact of Examples

Examples (Table 2): we find that tasks with examples have nearly 4x the latency of tasks without examples, and about 30% higher disagreement.

To the best of our knowledge, in prior work, there no quantitative evidence that shows that examples and images help, despite ample anecdotal evidence to support the same — so now you have it!

At the same time, despite the obvious benefits of including images and examples in task interfaces, we were surprised to find that most tasks do not do so.

In our paper, in addition to these results, we also:

  • Examine the effects of other design features on the three metrics of interest;
  • Drill down into specific task types and check for the same correlations, thereby eliminating some hidden variables;
  • Perform a regression analysis to predict the outcome metric values of a task based on its given design feature values; and
  • Find specific examples of otherwise similar tasks that differ from each other on one key feature.

Worker Characteristics

We perform several independent experiments to understand the characteristics of worker sources as well as individual workers.

The 80–10 behavior of crowd workers

Figure 5: Task completed by top 10% and bottom 90% of workers

We plotted the distribution of work completed by the top 10% and bottom 90% of workers by tasks completed in Figure 5 alongside. We find that just 10% of the workforce completes 80% of the tasks in the marketplace. Given their experience, it might be worthwhile to collect periodic feedback from the active workers and build lasting relationships with them.

Worker sources and their impact

Figure 6: Quality of worker sources

This particular marketplace gathers workers from over 100 different sources, including other popular crowdsourcing marketplaces such as Amazon Mechanical Turk (AMT). Interestingly, we note that the time taken by AMT workers on this marketplace is significantly worse than that of workers from the other major sources. Workers from Amazon Mechanical Turk (amt in Figure 6) on average take 5 times the median task time for any task. They also have a lower trust score than that of workers from the other major sources (more in our paper).

In our paper, we also delve into the distribution of workers by geography, worker sources, number of tasks, worker accuracies, as well as long-term behavior, such as number of days and hours active, and lifetime of activity.

Overall Takeaways

Our work in gaining a better understanding of crowd work in practice has broad ramifications for academics and practitioners designing crowd-powered algorithms and systems.

As examples, understanding the relative importance of various types of data processing needs, can help spur research in under-explored areas; understanding how tasks are picked up can help the academic community develop better models of latency and throughput; understanding the worker perspective and engagement can aid in the design of better models for worker accuracy and worker participation in general; and understanding the impact of task design can help academics and practitioners adopt “best practices” to further optimize cost, accuracy, and latency.

Acknowledgements

We thank the NSF for funding this research; we thank the crowdsourcing marketplace folks for pointing us to this rich and fascinating dataset.