The Problem with Using Social Media Data for Science

Woodrow Hartzog’s Slate post “There is No Such Thing as ‘Public’ Data” reminded me of something I wrote for the defunct TouchVision TV back in November, 2015. I’ve reuploaded it here, with some minor edits.

Many of us are the subjects of scientific research online and most of us don’t even know it. And yet, this type of surveillance and data collection for science is quite common. Is being an unwitting research subject just another hazard of social media we need to accept, much like how our tweets run the risk of inclusion in a news article one day?

It shouldn’t have to be.

There was a new paper out last fall, making the media rounds, about the potential for researchers to glean many a thing about the human mind by mining data from social media. One thing missing from the coverage, and the conversation, is that no one has consented to having his or her data analyzed and studied. In fact, scientists have been using data from unknowing participants for years now, much to the chagrin of ethicists.

This particular paper, published on November 11 in the journal Trends in Cognitive Sciences, details how neuroscientists can use your social media postings to figure out things like your personality, emotional state, mental health, and your “level of social conformity,” among other things. Even how fast you scroll through your news feed can be used to determine something about you. The researchers break down five key online behaviors to help in their research, and the scientists themselves seem quite enthusiastic about potential findings.

“Neuroscience research with social media is still in its infancy, and there is great potential for future scientific discovery,” says co-lead author Dar Meshi of Freie Universität in Berlin, in a statement. It is precisely because social media is still in its infancy that we must establish research guidelines or, at the very least, adhere to research guidelines that already exist. There are guidelines for this sort of thing, already, actually. One that every researcher is aware of is the Nuremberg Code, a set of scientific principles that was created after World War II to address the atrocities carried out by Nazi scientists.

The very first tenet — and, arguably, the most important part — of the Nuremberg Code is the “informed consent” of the individual. It is “absolutely essential” the research subject is willing, knows he or she is being experimented on, and what the research is being used for. It is so important, the Nuremberg Code has eight minimum requirements for what it calls “informed consent.”

Every researcher knows how important informed consent is, with university science departments clearly stating as such, and creating their own little manuals detailing research ethics (like University of South Carolina or Ohio State University, for example). Even Berlin’s Freie Universität, where the aforementioned paper was written, has explicit guidelines on informed consent and the most ethical way to go about research. The US Department of Health and Human Services has informed consent guidelines for ethical research, too.

As widespread as research ethics are, the concept of informed consent hasn’t translated well to the digital realm. And this isn’t just hyperbole over one paper; scientists have been disregarding informed consent guidelines for years. At Loyola University’s second annual international digital ethics symposium in 2012, University of Wisconsin researchers Zimmer and Proferes told those in attendance that anyone who posted a tweet between July 2010 and October 2011 was in a data set where their tweets were being analyzed by various researchers. At the time three years ago, there were already 244 studies being conducted of which 194 million users were unknowingly a part.

It’s easy to see why researchers don’t bother to get consent from online subjects — it might affect what the subject says and does. If people knew they were being observed and studied, they might not post as much or as earnestly, which would then skew results. And it’s not like scientists are torturing subjects like the German ones were during WWII. But as the Foundation for Genomics and Population Health noted, besides being ethical, “a valid consent is beneficial to protect both parties” because “society has become more litigious.” The University of Amherst Massachusetts also pointed out some data collection could open the subject up to harm.

So what is happening here? Why did all these researchers and scientists stop getting consent from research subjects?

It appears researchers are operating under the assumption that anything on social media is a public space, and fair game. In many ways, it parallels what is happening with online journalism, where the use of social media postings by journalists has become a contentious debate.

The comparison that gets mentioned a lot regarding use of social media postings is that the Internet, or, in this case, Twitter, is like a public park. But as Anil Dash argues in a Medium post, what is public online is more complicated than that. While it is technically legal to record a private conversation two friends are having in a park, a cafe or even a boutique, and then put it on the Internet, people don’t do this because society views this behavior as unacceptable and inappropriate. We view this as an invasion of privacy, and a disregard for people as human beings. “There are a lot of behaviors that are not entirely illegal that are profoundly destructive to an individual’s life, or to society’s fabric,” writes Dash. “Ultimately, we rely on a set of unspoken social agreements to make it possible to live in public and semi-public spaces.”

One of the main issues with using someone’s words without their consent in news articles is that it opens them up to harassment, whether or not the article portrays them negatively or positively. The audience for scientific research is inherently limited (sometimes to only a handful of researchers), but that’s not an excuse. There’s a reason why people got angry at Facebook for the psychological experiments they were conducting on their users, and why people are pissed off at the NSA.

No one likes his or her data being used in a way they don’t agree to or being collected without their knowledge. Even if the reason for mass data collection is something as lofty as fighting terrorism or trying to understand the human mind it is still a violation of privacy. People have a right to know what is happening to their data, and if they don’t want to participate in surveys, media articles, or even research for advertising or scientific purposes, they shouldn’t have that decision taken away from them purely due to the public nature of the Internet. Just because the data is there, out in the loosely defined public sphere of social media, doesn’t mean you can take it. At the very least, notify the people whose social media data is being used in some way.

In our quest to understand people, we can’t forget to treat them like human beings.

