One postdoc’s path to reproducibility
“Everything could have been done much faster.” This was my main reflection, just after finishing my Ph.D. Like many scientists, I relied on previously published works and tried to build upon them. If you have ever tried to reuse somebody else’s research, chances are it was a challenge.
I discovered that this “reuse” problem was a widespread global concern, which many call the “Reproducibility Crisis” in science. Realizing the magnitude of the problem was the first step to suggest a solution which would later be named Code Ocean.
As part of my Ph.D., my scientific efforts were dedicated to exploiting airborne and spaceborne multispectral and hyperspectral images for the purpose of environmental monitoring. This work was a joint effort with leading researchers from the DLR (German Space Agency) and different universities and geological societies around Europe, South Africa and Kyrgyzstan.
As we got deeper into the project we faced multifaceted complexities. If the code was available on GitHub or any other repository, we came across a variety of common roadblocks one can face when trying to reuse someone else’s code, including:
- Obtaining the right operating system, programming language, dependencies and their correct versions, before debugging the code.
- Acquiring hardware to run that code, such as GPUs or a significant amount of memory.
- Connecting with the researcher or co-author for further possible missing files or dependency code, which can often lead to connecting with one person after the other.
- Digging deeper into broken links by looking information up in search engines or contacting the original authors.
- Troubleshooting errors you don’t know how to solve by searching the solution in StackExchange or asking favors of colleagues.
- Plus, any of those tasks can lead down a rabbit hole or a dead end.
For every article, there could be thousands of readers who will try to reverse engineer the experiment. This is an extreme waste of time of the brightest minds in the world. So why not make all the material available and executable to enable users to independently reproduce the findings? The peer review process in scientific publishing, is meant to enable that upon publishing anyway. It will just take much more time, effort and funds to do it.
Making the code and data available is an important step, but we can also eliminate all the “IT” setup and installation with today’s technology. This will allow researchers to invest their energy in building-upon and moving existing findings forward.
While there is a lot of material about reproducibility and reuse, I couldn’t find a tangible solution that I could apply easily and effectively. As part of the 2014 cohort of the Runway Startup Postdoc Program at the Jacobs Technion-Cornell Institute, Cornell-Tech NYC, I, together with a founding team quietly developed a solution for the problem for a period of two years. Code Ocean was born.
For the first time, authors of scientific articles can upload their code and data in any open source language, as well as MATLAB and Stata, and link a working computational environment together with the code and the associated article.
Researchers and engineers can change parameters, modify the code, upload their own data, run it again, and see how the results change — without installing anything on a personal computer. Everything runs in the cloud and is easily citable for academic credit with a DOI.
The mission we set when founding Code Ocean was to make the world’s scientific code more open and reproducible. I hope my fellow researchers will find the work we do at Code Ocean beneficial for them to streamline their research activities, share the work with their peers, link it to their publications and reuse it in new and exciting research.
Originally published at codeocean.com.