Leveling the Scientific Playing Field through Open Research

TU Eindhoven
9 min readDec 27, 2016

--

Wil van der Aalst | University Professor @ Eindhoven University of Technology

My first paper was published in the proceedings of the European Simulation Multiconference that took place in Nurnberg in June 1990. I can still recall the steps I had to take to get the paper “Modelling Flexible Manufacturing Systems with ExSpect” published. To submit the final version I had to glue the text and figures on large paper sheets with margins and columns marked using non-photographic blue ink. The sheets were sent to me by the conference organizers after acceptance of the paper. The call-for-papers and the conference program were also sent by ordinary mail. My professor did not have an e-mail account and scientific communication was mostly based on personal contacts. Recall that this was the period where only a few computers were connected to the internet. In fact the first web page still had to be created and my XT computer did not even have a hard drive. This is in stark contrast with the situation my PhDs are in today. Millions of scientific papers in our discipline are just a few clicks away in Google. Also the requirements changed dramatically in our research discipline (computer science). It is not enough to just present a model and prove some properties through mathematical reasoning. Today, many conferences expect a fully implemented system (proof-of-concept) and a range of experiments using real-life data. The new means available for researchers also impose new responsibilities culminating in the term “open research”.

What is Open Research?
Wernher von Braun once said “Research is what I’m doing when I don’t know what I’m doing”. This romantic view on research is still appealing, but we can all witness rapid changes in the way we conduct research, publish results, and share artifacts like data and software. There is a global movement towards making scientific research and related artifacts (data, software, etc.) accessible to all levels of inquiry. We are gradually moving towards truly “open research”, also referred to as “open science”. Unfortunately, this is not yet a reality. Many journals are still behind a “high paywall”, journals and conference do not require the software and data to be publically accessible, and only a small fraction of scientific research can be reproduced by peer reviewers.

Open research has several ingredients

Open research is not limited to open access publications. It also includes open reviewing, open data, and open software. The goal is to create a level playing field for research, no matter where you live, and to make research reliable and easily reproducible.

First issue of the first scientific journal (Henry Oldenburg, Philosophical Transactions, CC BY 4.0)

Open Publication
The first journal in the world exclusively devoted to science, was the Philosophical Transactions of the Royal Society. Its first issue was published in London on 6 March 1665 by the British Royal Society’s first secretary, Henry Oldenburg. The main purpose of scientific journals was and still is to acknowledge the “priority of discovery”, i.e., giving credits to the first person or group to disclose a novel scientific finding. Disclosure is complemented by validation, i.e., other scientists need to assess both the accuracy and importance of the work. Conference publications have a comparable role, but are linked to meetings where people can challenge each other’s findings. Both types of publications are important and it is clear that the publication process requires efforts from editors, reviewers, and publishers. The printing and distribution of journals and proceedings in paper form was a costly process, explaining the expensive subscription fees. However, today, researchers mostly access the electronic versions of journals. This triggers the question why publishers should still receive substantial amounts of money for the distribution of work done by the scientific community (writing and reviewing). Hiding scientific results behind a paywall will slow down research, because many researchers have no access to the latest scientific developments. Therefore, many governments and scientists advocate (or even impose) the free availability of scientific publications (“open access”).

Open Reviewing
High-quality reviews are essential for ensuring the quality of scientific research. Critical feedback often generates new ideas both on the author side and the reviewer side. Moreover, incorrect or unclear results should be scrutinized by experts before they are widely distributed. The “publish or perish” culture has unfortunately created a situation where young researchers are stimulated to “write rather than read”. Part of the problem is the absence of “space constraints” in the world of electronic publications. Anybody can start a new electronic journal thus causing an information overload and a tsunami of poorly validated scientific results. Highly qualified reviewers are simply outnumbered by lesser qualified authors spamming journals and conferences. Moreover, review work is hardly visible and not rewarded sufficiently in today’s evaluation and promotion processes. A researcher’s curriculum vitae will never reveal that the person avoids peer review work or delivers superficial reviews. A fully open review process or completely novel ways of reviewing are needed to acknowledge the importance of true scientific interaction and to improve the transparency of scientific results.

Open Data
The exponential growth of available data and progress in data science are changing the way we conduct research. In many disciplines we can witness a shift from purely model-driven research to research based on real-life data. Scientific studies in medicine, chemistry, physics, biology, engineering, social sciences, and humanities have become much more data driven. Data often serves as “evidence”, but in several disciplines the sharing of data is still an exception rather than the rule. Some researchers resist sharing their data. This is remarkable, because most scientists are publicly funded and therefore their data cannot be considered as “private property”. (Of course work on data sets needs to be credited properly.)

Data is also needed to validate research results. Unfortunately, as the recent study in Nature by Monya Baker shows, most of the results described in literature cannot be reproduced. Based on a survey involving 1,576 researchers, the Nature article reveals that 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments. Factors explaining this include the pressure to publish and selective reporting.

It is far from trivial to make data accessible for a longer period. Published papers typically remain available “forever” (assuming a reputable publisher). However, the data used in such papers may only exist on the laptop of a PhD student or on the website of the research group. When projects end or researchers retire, the corresponding data sets often disappear. Fortunately, facilities to share data are improving. For example, the 4TU.Centre for Research Data (short, 4TU.ResearchData) offers an infrastructure for sharing and safely preserving applied scientific research data. Data sets hosted by this 4TU.ResearchData have a DOI (Digital Object Identifier) and are guaranteed to be available indefinitely. Researchers can click on such a DOI link in a paper and immediately obtain access to the corresponding data.

Open Software
To reproduce scientific experiments it is not sufficient to just have the data. Often, specific software is needed to process the data. In fact, the software itself may be the main research result. In many research projects, novel software is developed in order to carry out the research. Consider for example a machine learning paper presenting a new deep-learning algorithm that is evaluated using several data sets. The paper could not exist without the software and the data. However, the software and the data can exist without the paper. Therefore, it is odd that the paper may be accepted without providing access to the artefacts created and used. The authors may have made a programming error or consciously (or unconsciously) manipulated the results. Therefore, other researchers should be able to reproduce the results with a little effort as possible. Purely theoretical research can be evaluated and replicated based on the paper only. However, more and more academic work is based on an implemented system and complex experiments that cannot be fully described in an academic paper. Fortunately, it is relatively easy to share software and more and more research projects develop open source software as an important by-product of research. Just on GitHub one can find over 49 million open source projects. Through repositories like GitHub anyone can inspect reported software artifacts and even modify and improve them. Therefore, there are no good reasons to not share software developed using public funding.

Timeline towards a level playing field in science
As mentioned, open research builds upon open publication, open reviewing, open data, and open software. Regrettably, I have to be critical about the “openness” of contemporary science activities. Zooming in on computer science, an area that should be leading in open data and open software, one can witness various kinds of resistance. Opposition against open research can often be linked to varying degrees of “sloppy science”. Many papers report on software systems that have only existed on the PhD student’s computer. Authors may describe the architecture of a complex system that only partly existed. Functionality suggested in the paper may not have been implemented. For an external party, results are almost impossible to evaluate without access to the code. The reviewer needs to make guesses based on the reputation of the authors. This is undesirable, because ensuring the reliability and reproducibility of scientific results is one of our main contributions to society. Moreover, most journals are still only accessible for researchers working at universities in well-developed countries. I hope that we will be able to create a “level playing field in research” in the next couple of years. As Bill Gates wrote in his book The Road Ahead (1995): “We always overestimate the change that will occur in the next two years and underestimate the change that will occur in the next ten.” Hence, it may very well be that open research is the norm in 10 year from now.

Scientific competition
The emergence of scientific journals in the 17th century and subsequent growth (today there are over 20.000 journals) can be attributed to the desire to acknowledge the “priority of discovery”. In each scientific discipline research groups are competing and try to be first, fastest, or best. Next to coming up with new problem statements, groups are trying to crack known problems. If a problem is well-defined, it is even possible to organize competitions. In data science one can find many of such competitions. The Kaggle data science competitions (www.kaggle.com) are a nice example. Currently, there are 234 competitions in progress. Typically a large data set and a well-defined question are posted. The evaluation criteria are defined upfront. In some competitions thousands of teams compete and the award for the winner may be as much as half a million USD. Kaggle also supports the sharing of data and software (e.g., Python or R scripts). In my own field, the Business Process Intelligence Challenges (BPIC) organized by Boudewijn van Dongen play an important role. In the majority of today’s process mining papers BPIC data sets are used to evaluate new techniques for process discovery and conformance checking. The fact that groups are competing to find better techniques (better in terms of speed and/or quality) has a sanitizing effect on the field. Researchers cannot propose new process mining algorithms without comparing their approach with existing approaches. This stimulates true scientific progress. Young researchers sometimes seem to be more concerned about their publication record than the actual research achievements. We should not compete for a few slots in arbitrarily chosen journals, but should aim at impact (both in science and society).

Although the overall effect of competitions is positive, it is important to also award originality. It would not be good if groups are just competing on “known unknowns” and don’t research the “unknown unknowns”. Competitions may also lead to a tunnel vision and more of the same. Scientific communities should avoid straitjacketing researchers by imposing rules and expectations that kill creativity.

This article aims to stimulate discussion on the way we conduct research. We should embrace advances in technology to open up science by sharing papers, data and software without any boundaries. As Linus Torvalds (the creator of the operating system Linux) once said: “I often compare open source to science. To where science took this whole notion of developing ideas in the open and improving on other peoples’ ideas and making it into what science is today and the incredible advances that we have had. And I compare that to witchcraft and alchemy, where openness was something you didn’t do.” Sloppy science, just like witchcraft and alchemy, can be discouraged by open research.

[This article is partly based on the editorial “Wil van der Aalst, Martin Bichler, Armin Heinzl: Open Research in Business and Information Systems Engineering. Business & Information Systems Engineering 58(6): 375–379 (2016)”.]

--

--