Crowdsourcing for data collection — A very responsible task!

3 min readJan 26, 2020

Data collection is a very important responsibility — you are basically defining the task which would actually shape the direction of many researchers. There are a lot of data collection platforms that researchers use, eg: survey links, FigureEight or Amazon Mechanical Turk (by far the most common). Most of this blog is written from the viewpoint of AMT and Conversational AI (dialog datasets). Dataset papers are dominating major conferences and are also, in general, the most cited ones in your research career. Some of my observations:

IMO both the quality and quantity of the crowdsourced data depends on the crowdsourcing platform. Some of the platforms (visdial-amt-chat, ParlAI, CoCoA and many more) have their integration with MTurk. The basic principle is to redirect the AMT turkers on your specific platform/website and eventually pay them using AMT. Most of the dialog datasets that I know of have been collected through AMT. Pairing is not THAT big an issue on AMT because of the ready availability of the turkers.
- Visual dialog dataset consists of massive 133k dialogs; all collected using their AMT interface!
- Facebook conversational datasets (mostly collected through ParlAI)
- This platform; used to collect the MutualFriends dataset. From their paper (He et al 2017): “We were able to crowdsource around 11K human-human dialogues on Amazon Mechanical Turk (AMT) in less than 15 hours. Each worker was paid $0.35 for a successful dialogue within a 5-minute time.”
Recently, researchers have explored the data collection for conversational datasets sequentially instead of pairing crowdworkers: eg. ImageChat, Multi-WoZ datasets. Each turn in the dialog is likely to be authored by a different crowd-worker but they are asked to continue the conversation as if they are the same speaker. However, both papers claim that the resulting conversations are natural.
Pricing is really really important to get more crowdworkers. Related to it, this paper actually did a systematic analysis of the wages: “mean and median hourly wages of workers on AMT are $3.13/h and $1.77/h respectively.”
Equally important is how we gamify our data collection procedure. There are many dataset papers that mention its importance. If the task is too boring, we can’t expect returning turkers. Also, the gamification of the task helps us to capture the phenomena that we ideally want. Example: discussed here and here!
Quality control — using checks such as number of words (turns in dialog) or maybe tailored specifically for the tasks. Really important! Some usually do another round of crowdsourcing to evaluate the quality of data collected. Example: GuessWhat? dataset collection.
We can pre-filter and identify qualifying AMT workers using very small test cases before they start the actual HIT. The qualification tests are not as flexible as the HITs themselves, and the questions and answers must be supplied as properly formatted XML files. The idea is to migrate a few of the questions from your HIT to the QuestionForm by creating the XML files for them (and the answer keys, if the questions are to be automatically graded). Reference: this nice blog. (PS. I will release my code in Python after a paper’s acceptance :D)
The reputation of the account also plays a significant role in attracting turkers. Use your company’s/lab’s account against your own.
If you are using crowdsourcing for evaluation, it is helpful to evaluate the Inter Annotator Agreement (IAA). Follow this recent 2019 paper which pleas to assess human evaluation reliability. However, they didn’t mention what implementations should be considered as a gold standard. I mostly follow this EMNLP submission code (in R).

Some code/blog pointers:

Boto is the umbrella AWS CLI in Python with Mturk as the endpoint.
Reference: https://blog.mturk.com/tutorial-a-beginners-guide-to-crowdsourcing-ml-training-data-with-python-and-mturk-d8df4bdf2977
There is also this repo by Justin Johnson with basic functionality to get accustomed to AMT.

Thanks for reading! I hope it was helpful. Let me know your feedback.

Crowdsourcing for data collection — A very responsible task!

Written by Shubham Agarwal