Mandatory data archiving is not (always) caring

Through twitter, Michael Hoffman shared a link to an interesting 2012 paper: 56% of authors that promise to make the data behind academic papers available on request, don’t.

This certainly matches my own anecdotal data. But what can reviewers and editors do? This triggered an interesting discussion, with solutions ranging from retraction (if the data are not made available, the paper is removed — an idea similar to an old BMJ editorial likening not sharing data to malpractice), to mandatory data archiving.

Why do we care? Data are costly to produce. When public funding is involved (which is frequently the case), the better solution is to re-use already available data, instead of collecting new ones. This can only be achieved if pre-existing data are archived, accessible, and discoverable. Because this is not always the case, we are currently losing data.

Mandatory archiving of data is already the rule in some journals. The British Ecological Society adopted a policy that requires all data to be made public, as a condition for publication. Before the final decision is made, the authors have to show that the data used in the paper have been deposited in a publicy accessible databank. The logic behind this policy is that data will be archived, and therefore preserved for the future. It also improves the ability to reproduce results, and makes science more transparent overall.

It is a policy that works to the advantage of data users (almost all of my research is based on the re-analysis and combination of previously existing data), but can be viewed as a high cost for data producers. Data are costly — the salary, the equipment, the time; getting the maximum bang for the buck is the rational decision.

This is a stronger mandate that the majority of journals, that only require that the data be made available upon request — this is impossible to enforce, especially since these requests are, I suppose, rarely brought to the attention of the editor. I have had experiences with similar requests where the authors introduced so many unreasonable conditions that I went looking for data elsewhere.

But mandatory data archiving is no silver bullet, either, as it might have the effect of turning some people away from journals that enforce it. And we certainly don’t want this, because it allows data to be lost for good (not to mention that it contributes to the fragmentation of a field, by forming cliques in various journals).

So what can reviewers and editors do?

Ask for a data release plan, and review it.

That’s it? Yes.

In most situations, I expect this data release plan will read something like “All data associated to this publication have been deposited under the DOI xxx”. Reviewing this statement would be easy. In a few situations, the statement might be more nuanced. Partial data can be released now, and the full dataset can be released (at a specificied time) in the future. In a minority of situations, authors can argue that data cannot be released — for security reasons, or because they come from collaborations with industries or private partners that oppose to their release.

This would be an easy to grant exception for datasets that have a high value, if the authors make a case for it — it’s fine not to share all data as soon as they are produced/published, as long as there is a solid justification. Editors and reviewers, being involved in the field, would be able to evaluate how strong this justification is, and suggest changes if they are needed.

That’s it? Not quite.

Here is the twist. The full dataset would have to be archived. But not made public. Specifically, it could be archived in such a way that allows editors and reviewers to examine it during the review process (along with the appropriate non-disclosure agreement). This will create a back-up copy, that the journal can use if there are requests for access to data — falsification, replication, or anything else that the authors agree to. Journals would become, in addition to their mandate to publish the research, curators of the data on which this research is based.

Asking for a data release plan, as part of the submission of the manuscript, makes sense and is easily actionable. Asking reviewers to treat it as any other part of the manuscript is too, and only requires to give them a copy of the journal’s data policy. It would also level the playing field considerably; researchers that are not comfortable sharing their data right now will be given the opportunity to make a case for it, as opposed to just moving to a different journal altogether.

My personal point of view is that we have a lot to gain from sharing data. We can ask a whole lot of cool, new, integrative questions. And my guesstimate is that funding agencies are not going to wait for long before they adress the data loss crisis with very prescriptive policies. But the worse that can happen is to alienate the groups that produce data, by making it appear as if these data are going to be taken away at the first occasion. (Academic) Research works under a publish or perish basis. If we add a Release your data to publish clause, then by the transitive property, the situation effectively becomes Release your data or perish. And no one wants that.