A truly reproducible scientific paper?

The current situation

One pillar of the scientific method is reproducibility, that is, being able to redo an experiment and get the same result. Sometimes this is very hard to accomplish. For example, if you’re doing field work, you might end up collecting some very rare samples or observe a rare natural event. If you work in the lab, tracking all the variables that go into an experiment and might influence it (e.g. the way a person pipettes or mixes reagents) is similar to following your mom’s recipes… most of the times it doesn’t give you the same amazing dish.

Is there a solution?

I’m going to share my current best approach to this problem. It’s far from perfect, so it would be interesting to know if someone has come up with something better. I think it’s a good enough solution for reproducibility and reusability.

Layers that need to be considered for reproducibility. With the right tools, we can achieve good enough reproducibility and high reusability in the layers inside the bold border.

At the programming language level

Currently I use each programming language package manager to keep track of specific versions of libraries used. Most of the time the best tools (i.e. less buggy) to install libraries for a programming language are the ones that the language uses (npmfor JS, pipfor Python, etc). So, relying on another package manager to do that job (i.e. the OS package manager), in my experience, gives you more trouble than it’s worth. However, sometimes this can have drawbacks for reproducibility (e.g. the left-pad crisis in Node) and we need to be careful.

JavaScript dependencies listed in a package.json file, and Python dependencies in a requirements.txt. For R, you can use Packrat.

At the Operating System level

Instead of relying on the OS packages manager (e.g. aptfor Debian, rpmfor Fedora) to keep track of the languages themselves and all the other dependencies (e.g. some languages require specific versions of Fortran, GCC compiler, etc) I use the Nix package manager.


How to tie it all in a reproducible paper

So for a reproducible environment, use each language dependency management system for modules and libraries. Then, use Nix to manage the languages themselves and any other OS level dependencies. But how to make the experiment/paper reproducible? How to show the way to a result, a figure or plot? That’s where things like interactive notebooks come into play (Jupyter/iPython, Beaker, nteract). There’s plenty of blog posts out there showing how to use them. Unfortunately, scientific output still relies heavily on static papers (PDF files), not yet on cool interactive self updating figures. So what currently works best for me is to use LaTeX (sometimes Markdown) to typeset the text and generate the PDF, and KnitR to integrate the R code used to generate the plots in the LaTeX file. KnitR also allows you to embed other languages in your paper files, such as Python or JavaScript.

Co-CTO at Resurgo Genetics; PhD at QMUL; MozillaScience alumni; Founder of Bionode.io