The Fourth Industrial Revolution (Episode 1)

Crowdsourced Data Management: Of Deep, Dark Secrets

4 min readJan 28, 2016

This is the first of N posts where we describe findings from our book, titled “Crowdsourced Data Management: Industry and Academic Perspectives”, Adam Marcus and Aditya Parameswaran, Foundations and Trends in Databases Series, Vol. 6: No. 1–2. December 2015.

With nearly a decade passing since crowdsourcing marketplaces have become commonplace, academic researchers and industry users alike have explored various mechanisms for orchestrating large scale data processing crowd work.

On the one hand, academic researchers have proposed programming languages, frameworks, systems, and algorithms, and have prototyped creative solutions to crowdsourcing problems. On the other hand, many companies have embraced crowd work as a mechanism for accomplishing what was previously infeasible or inefficient.

However, despite all this awareness that crowdsourced data management is important, and is an active area of research and practice, there is virtually no documented knowledge of how crowdsourcing is actually leveraged in the real world.

To remedy this gap in our understanding, we conducted a series of interviews of both large-scale industry users of crowdsourcing, and of crowdsourcing marketplace operators, to identify the status quo and “best-practice” implementations, highlight their chief pain-points and concerns, and articulate which areas of development and research have the most potential for impact.

Over the next few blog posts, we will describe the results of these interviews, in small, palatable units.

Of Deep, Dark Secrets

A crucial finding from our surveys is that crowdsourcing is an essential ingredient for any company working with large datasets, and many large tech companies are using crowdsourcing at scale. So why do we not hear about this more often in the popular media?

The reason is that companies are sometimes not willing to talk about how much they use crowdsourcing because they are either ashamed about admitting that they rely on crowds instead of sophisticated software or hardware (it’s their “dirty little secret”), or paradoxically because they consider it to be their “secret sauce.”

Primary Findings from our Industry User Survey: A Teaser

Here are our primary findings from the industry user survey.

Crowdsourcing is super common, and deployments are large-scale.

Every company we spoke to said that investments in crowdsourcing have only been scaling up over time.
Multiple companies we spoke to reported hundreds of employees using crowdsourcing on a day-to-day basis, hundreds of thousands of tasks per week, with overall spending in the several millions of dollars per year.

Many companies host their own platforms internally, and have several employees running these platforms.

This finding was perhaps the most surprising to us. Multiple participants operate their own crowd work platforms. That is, these participants do not use publicly accessible crowd work platforms like Mechanical Turk, CrowdFlower, Samasource or UpWork. Instead, they use an intermediary (read outsourcing) company to hire and host workers who work exclusively on the internal platform.
In this way, these companies get around security and quality concerns by keeping their workforce “in-house.”
These participants have several engineers (up to teams of 30) responsible for maintaining these crowd work platforms. The investment in terms of people-power to run these platforms is substantial.
So while we as academics have been singularly focused on the public crowd work platforms, we have missed out on all the action that’s happening outside these platforms.

Classification and entity resolution are the most popular uses of crowds.

This fact by itself is not surprising, but what was surprising was the sheer diversity of the use-cases we encountered over the course of our interviews (more to come on this later).
There are teams within companies that focus entirely on one type of crowdsourcing. For example, there are teams that only perform categorization, or only perform data extraction.

Industry users are rarely aware of, or use work from academia.

Quality management and incentive schemes used by industry are rather primitive; industry users opt for majority vote over Expectation-Maximization, and fixed payment per task over gamification or leaderboards.
Most industry users opt for simple workflows as opposed to complex ones, which are popular in academia. Why? More on that in later posts.

So What’s Next?

The subsequent blog posts will describe:

the different types of crowdsourcing usage patterns that we observed;
the use cases and pain points;
on workflows, quality control, hiring and incentives (and why industry users rarely use work from academia);
and the corresponding interviews from the marketplace operator perspective.

Do let us know if you’d like to hear about some specific aspects that pique your curiosity.

The Fourth Industrial Revolution (Episode 1)

Crowdsourced Data Management: Of Deep, Dark Secrets

Of Deep, Dark Secrets

Primary Findings from our Industry User Survey: A Teaser

So What’s Next?

Written by Aditya Parameswaran