How many scientists Open their Data and Open their Source ?

Bartosz Paszcza
from science import code
4 min readJan 10, 2017

Research is driven by data and code. Majority of researchers use research software, more than half of them creates their own. Yet less than a third shares data and code openly on the Web. Opportunity wasted: this means that reproducibility of their research is low, creates opportunities for duplication for spending time and (public) money to solve the problems that have already been solved by others.

Software Sustainability Institute (an organisation concerned with scientific code, SSI) found out during their 2014 study of 417 participants from Russel Group Universities that 69% of researchers say that their research would be “impractical” without scientific software; only 10% say it would make no difference if there was no software around. At the same time, 56% of respondents declared they develop their own software (see: It’s impossible to conduct research without software, software.ac.uk).

A recent survey by Bianca Kramer and Jeroen Bosman (Kramer & Bosman 2016) asked researchers questions regarding the tools they use for specific parts of scientific endeavour. With more than 15k respondents (counting from PhD candidates level up), it is a valuable source of insight into research workflow practices. It is worth taking a look at what they already found out.

The survey lists (among others) questions on literature searching, annotation or notebook sharing. I have decided to take a look into two questions: regarding usage of tools for data/text analysis and code/data archiving and sharing. Each multiple choice question listed seven tools chosen by authors of the survey and an “other” textbox, where respondents could list additional tools.

Figure: This is how a question in the survey looked like (source: GitHub and more: sharing data and code, 101innovations.wordpress.com)

How many of researchers creating data/code are sharing it?

Support for Open Access and Open Science among researchers from the EU (source: Support for Open Science in the EU member states, 101innovations.wordpress.com)

The survey indicated an overwhelming support for the Open Science movement, reaching 79% among the researchers working in the EU. However, if we look at response rates for individual question, it turns out that although large majority of researchers use tools for data/text analysis, only between 20–45% (depending on discipline) indicated usage of tools to share such outcomes of research openly.

It is important to note that among the data/text analysis tools we could find environments such as Excel, Matlab, SPSS, R, and Python.

Figure 3: Creating code, (not) sharing code: percentage of researchers using data analysis and data/code archiving tools

Researchers in field of engineering and technology are currently leading the polls, but still less than half of those who create data/code in the field share it. The relatively high score in this category can be hypothesised to be a partial result of inclusion of computer science community, who are probably accustomed to the standards of Open Source community and hence more keen to share their work. Still, The above mentioned SSI study found out that 56% of researchers create their own code — a number significantly higher than those that indicated sharing/archiving it.

There is a clean, nine percentage points distinction between the top three (physical, life sciences plus engineering and technology) and the rest of the seven fields.

Although more than 1/3 of researchers from physical and life sciences or engineering and technology archive/share their research data/code, only less than 1/4 in medicine, social sciences, economics, art and humanities do. Why?

It would be interesting to look qualitatively into why are researchers in some disciplines more likely to share data/code than others. Of course, some reasons (e.g. regarding the sensitivity of the medical information) may be easily identified — but by investigating the question quantitatively, we could uncover the dominant barriers to data/code sharing.

Quantitatively investigating not only disciplinary differences, but the general obstacles to code and data sharing could be fruitful. There may be plenty of them: from the black-boxed, tangled “spaghetti code” with no documentation to the fact that 21% of researchers developing their own code have no formal training in software engineering. As Jay, Sanyour and Haines put it in the title of their presentation on the first conference of Research Software Engineers: “Not Everyone Can Use Git” — the tools supporting Open Source community in computer science are often seem as complicated for researchers, for whom writing code is not the main goal.

Steps forward

Jay, Sanyour and Haines went further than to qualitatively identify the issues: they asked Research Software Engineers (software engineers working in academia) what could improve the situation. Majority declared that having an easier GUI or automatic changes tracking in repositories would help (see “Not Everyone Can Use Git”). Online training and workshops were also welcome.

Importantly, 71% declared that such actions would improve reproducibility of research.

That’s what matters, but it’s not the only reason. Oxford Computer Consultants was born in order to take software written by scientists, re-write it and commercially license it. As the Economist put it — “Professors unprofessional programs have created a new profession”.

Bibliography:

Kramer, B. & Bosman, J., 2016. Innovations in scholarly communication — global survey on research tool usage. F1000Research, 5, p.692. Available at: http://f1000research.com/articles/5-692#.VxS0UkGluG8.twitter.

--

--

Bartosz Paszcza
from science import code

PhD student at Web Science Institute, University of Southampton; writing for jagielloński24.pl