Analyzing data is all the rave — but collecting it is the real challenge

NYU Professor Michael Laver on crowdsoursing and reproducing data

With all of the advancements that are taking place in the field of data science, one aspect that can often be overlooked is the actual collection of data. While machine learning, neural networks, and artificial intelligence have done wonders for data analysis, the problem of data collection still troubles data scientists. In the social sciences, data collection has often relied upon expert academics to gauge and contribute data. While this method works well, the problem is feasibility, as experts in any field are expensive to hire. And tied to the issue of feasibility, is the issue of reproducibility: the ability to replicate data collection and analysis. Often times, data scientists have difficulty presenting their exact protocols in a way that allows other researchers to replicate the process with another problem.

Crowd-sourced data has the potential to help data scientists solve the issue of feasible data collection, and can help standardize reproducibility. Certain fields within the social sciences don’t require experts, and this provides opportunities for data scientists to develop new and innovative data collection methods. Take the field of political science. The average person might not be an expert in the field, but they could probably tell you the policy differences between Bernie Sanders and Donald Trump. This is where crowd-sourced data collection comes into play: instead of hiring experts, data scientists can use the internet to draw from a wide pool of paid workers, who can gauge a political ideology along an axis. The effect is a cheaper and more diverse data set, that can be more effectively replicated, as the protocol for data collection can be more easily presented.

Michael Laver — a Center for Data Science Faculty member, and a Professor of Politics at NYU — recently co-authored a paper titled, “Crowd-Sourced Text Analysis: Reproducible and Agile Production of Political Data.” In this paper, he demonstrated the use of crowdsourcing as a tool for data collection, specifically in the field of political science, although the results of this paper have an impact on almost all of the social sciences.

Can you talk about how this project came to be?

Originally, we were looking at how political discourse can be analyzed through language processing. Our baseline interest was in scaling political documents — locating them on left-right, or liberal-conservative, scale. But over time, we became increasingly aware of the importance of having our own well-tested system for quality control, and the emphasis shifted more and more towards replicability.

Can you talk about the importance of replicability in the field of data science?

Many of the major datasets in the social sciences — however professionally collected — cannot be easily replicated. In other words, it would be difficult for a third-party researcher to use an identical protocol, and collect the same data.

Why is data replication so difficult?

Data replication is difficult for two reason: the initial costs, and nonspecific protocols. Some of these big datasets cost a fortune to assemble, and nobody is going to pay to reassemble them over and over to demonstrate their replicability. Crowd-sourced data sets address both of these issues.

How does crowdsourcing facilitate data replication?

The cheapness of crowdsourcing makes it much more feasible, in term of resources, to collect “the same” data again. To replicate a crowd-sourced dataset, you just need the crowdsourcing code and a small amount of funding.

Are there any disadvantages for using crowd-sourced data collection methods?

The data collection protocols must be very explicit, since they are written to understood by crowd workers all over the world. Every crowd-sourced data collection task has to be broken down into very simple and easy-to-understand instructions. Some data collection tasks may be too complex for this.

Is there a difference in the way that political data is collected as opposed to other types of data?

In politics, a politician or political actor is represented as a data point along an axis that gauges their relative political standing (left to right, liberal to conservative). So in the field of political science, all of the data points are relative to each-other, as opposed to being gauged against a pre-set variable.

Was there anything that you had to take for granted before proceeding? Did any of your assumptions change as your research went on?

Some of our early advisors recommended that we only work in the English language, as a way of ensuring consistency. But eventually, we decided to challenge this by replicating our methods in five languages: German, Spanish, Italian, Greek and Polish. Since the crowd workers are responding on numbered scales, we didn’t have to translate any answers, we only had to make sure that our questions were properly translated. This expanded the pool we could draw from, and worked much better than we expected.

Generally speaking, what were your findings with this project?

We found that labeling political texts with ideological positions can effectively be achieved through crowd-sourced data, with results that closely match those achieved by highly trained experts, and at a much lower price point.

Originally published at on May 17, 2016.

Like what you read? Give NYU Center for Data Science a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.