Do you trust the crowd? 3 ways to improve crowdsourcing at your company

Pinterest Engineering
Pinterest Engineering Blog
5 min readMay 5, 2017

Jay Wang | Pinterest engineer, human computation

Crowdsourcing, or what we refer to as human computation, plays a large role at Pinterest. From measuring the relevance of search results to detecting spam, our use cases are vast. The premise behind human computation is simple–some tasks are just better suited for humans than computers. Additionally, with the emergence of machine learning, reliable training data has quickly become a must-have. All of this drives a need for quality human computation at a non-trivial scale–100k+ tasks per day.

To meet this demand, we use a number of external platforms including Mechanical Turk, Crowdflower, Upwork and Mighty AI. While we’ve had success with these platforms, results weren’t always accurate, which led to imprecise measurements and noisy training data. In this post, we’ll describe practical crowdsourcing techniques we employ to combat low quality ratings and deliver meaningful results.

The challenge

Think of the crowd in three worker types. First are the spammers, low quality workers trying to maximize profit and minimize effort. Next are the triers, workers who answer faithfully, but don’t always deliver the best work. Last are the producers, workers who provide quality ratings. Our challenge on the Human Computation team is to develop techniques that remove and ban spammers, mitigate the adverse effect of triers and extract signal from the producers.

Sofia

Last year, we built Sofia, our in-house human computation platform. By integrating with the aforementioned platforms (MTurk, CrowdFlower, Upwork, Mighty AI), Sofia provides an intuitive interface for requesting work while abstracting away the idiosyncratic complexities of the external platforms. Moreover, Sofia affords us complete control of the worker experience, from how we assign work to who we allow on our platform. This control allows us to develop unique solutions to our challenges.

(Note: this post won’t discuss how we built Sofia, but will highlight quality control techniques we use in Sofia.)

Technique 1: task design

We use a simple data model in Sofia.

Figure 1: Sofia data model

The core unit of work in Sofia is a question (captured in red). Tasks (captured in yellow) are comprised of multiple questions. An assignment (captured in blue) is a collection of tasks we bundle together and assign to a worker.

Figure 2: Job creation in Sofia

Above is a screenshot from Sofia’s job creation dialogue box.

The first thing we do is manage the number of tasks per assignment, as worker fatigue is a common issue in crowdsourcing. Rather than impose a hard limit on how many tasks should be in an assignment, we make that parameter configurable (shown in red). Next, we incorporate task randomization (shown in yellow), so when a worker opens an assignment, questions are randomly chosen from the pool of available work. We do this to minimize any potential bias in the ordering of our data. Finally, the most valuable thing we’ve learned in designing human computation jobs is the importance of redundancy–that is, requesting multiple judgments for the same task (shown in blue). Later, we’ll explain how a technique called “majority voting” can be used in conjunction with multiple judgments to mitigate worker error.

Technique 2: blacklist spammers

The next thing we need is a way to determine an individual worker’s quality. To do this, we use test questions. Test questions (also known as gold questions) are questions that you already know the answer to, yet you ask anyway to test the quality of a worker.

A good test question has three features:

  1. It’s indistinguishable from a regular question
  2. It’s nontrivial to answer
  3. An incorrect answer strongly suggests a poor quality worker

The test question in the example below allows us to identify workers who are able to tell the difference between low carb and low calorie.

Figure 3: gold question

With the ability to ascertain a worker’s quality, we then blacklist workers who fail a certain number of test questions. Additionally, we discard answers from blacklisted workers and request more responses for the discarded work. Because our customers’ quality requirements vary, rather than impose a uniform blacklisting policy, we make the strictness of blacklisting configurable to one’s personal need.

Figure 4: blacklisting controls

As you can see, our blacklisting policy is rather simple–it takes into account the number of test questions a worker has done as well as his or her accuracy on the test questions. Yet this simple policy gives us confidence that spammers are not polluting our results.

Technique 3: majority voting

The final technique we employ is majority voting. Leveraging the redundancy in our task design, we use majority voting to aggregate answers by selecting the answer that occurs repeatedly for more than half of the answers for a given question. We’ve found majority voting, a crude form of outlier detection, is capable of sifting out noise from signal. After all, while blacklisting is effective in banning the poor workers, we still need a technique to mitigate the effects of less quality work that made it through. As long as we’re confident in the overall quality of our crowd (which we are thanks in large part to blacklisting), majority voting noticeably improves the quality of our work.

Here’s a chart comparing the accuracy of various runs of a simple human computation job with and without majority voting.

Figure 5: effects of majority voting

Figure 5 shows the crowd accuracy for the same human computation job run twelve times. We rated all the data internally before we launched it to the crowd to gauge crowd accuracy. We found that aggregating worker results using majority voting increased the accuracy of the crowd by an average of 8.4 percent.

Conclusion

Human computation at Pinterest has come a long way with these simple techniques. Currently, Sofia is used by dozens of our teams for a wide range of human computation tasks. However, we’re just scratching the surface of how good human computation can be. The literature in this space is rich, and we’re just beginning to apply academic learnings to Sofia.

Looking forward, with our wealth of worker quality data, we’re planning to build machine learning models to compute individual worker quality scores. This will unlock more advanced task assignment strategies as well as influence how we aggregate answers. If the space of human computation interests you, please join us!

Acknowledgements: Sofia could not have been built without the following individuals: Chung Eun Kim, Arvin Rezvanpour, Garner Chung, James Rubinstein, Veronica Mapes, and Mohammad Shahangian. Huge thanks and appreciation to them!

--

--