Scientists’ ugly fights about data sharing

A bucolic campus not directly implicated below, that shall stand in for the generic Ivory Tower. Apologies to Princeton.

As a sociologist, it is not at all surprising that sharing is hard. The technology for sharing is often difficult. Researchers found that even when astronomers shared their data and software, link rot and other maladies of technological decay left 44% of these good intentions practically meaningless ten years on. NSF now requires data management plans, which should help, data hoarding lingers in the culture of some disciplines.

To me, the data sharing fight is a major professional shift about how to do science that has been provoked by the increasingly computational nature of scientific practice. Data are digital, many methods are code-based, and it is more possible to duplicate and share digital data than, say, specialized cell lines or large, expensive instruments. The problem, then, is that the act of sharing these intermediate products — code and data — runs counter to other tenets in the culture of science. In the fading ownership model, which dovetails nicely with our broader capitalist ethos, data belong to the scientist or team who gathered them. The paradigm gathering steam is moving towards an “open science” or research trust model of data in which data belongs to science, and should be stored collectively in a trust.

The fight between the New England Journal of Medicine and most other top researchers and publishers in medicine made it appear that NEJM cares most about protecting scientists’ professional careers by allowing them to hoard their own data than about any of the efficiency or collective benefits related to sharing. It is, then, a relief to see Nature’s editorial board call for ”the designers of algorithms to make public the source of the data sets they use to train and feed them”. Nature also celebrated changes to the US Department of Health and Human Services policy on disappointing clinical trials. Those failures must be disclosed, even if they don’t help specific scientists’ careers. If only this applied to all research, not just clinical trials.

One major concern about sharing human-generated data is that privacy protections have to be phenomenal. Gary King, a sociologist at Harvard, released a Private data Sharing Interface (PSI), “to enable researchers in the social sciences and other fields to share and explore privacy-sensitive datasets with the strong privacy protections of differential privacy”. While we’re talking about practical tools and policies for data sharing, here is a frameworks for citation practices for citing software, software packages, and datasets. These types of tools are often presented as a sociotechnical response to the professionalization gap. If researchers are cited for their software and data similar to the way they receive citations for publication, perhaps their careers can be boosted, not scooped, by sharing their data, software, and algorithms.

The juiciest long read of the week is from statistician Andrew Gelman who filets psychologist Susan Fiske for writing about, “self-appointed data police” who crash careers by publicly exposing errors or catching researchers in games of p-hacking for professional gain.

This fight about sharing will continue as scientific disciplines reconsider who they are for — their own careers do matter but so does the overall timeliness and beneficial impact of science on society — and how the sociotechnical assemblages in their organizations and fields can reconcile the tension that technically easier duplication has introduced.

Like what you read? Give Laura Noren a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.