A truly reproducible scientific paper?

The current situation

One pillar of the scientific method is reproducibility, that is, being able to redo an experiment and get the same result. Sometimes this is very hard to accomplish. For example, if you’re doing field work, you might end up collecting some very rare samples or observe a rare natural event. If you work in the lab, tracking all the variables that go into an experiment and might influence it (e.g. the way a person pipettes or mixes reagents) is similar to following your mom’s recipes… most of the times it doesn’t give you the same amazing dish.

However, when analyzing data on a computer you would expect to have control of all the variables and be able to re-run everything you did at the touch of a button. Right? Well… it’s complicated. There are so many software layers with intricate dependencies relying on very specific versions to work that it’s very hard to keep track of everything and very unlikely to expect that old code will run in a new environment (operating system and software stack).

Virtual machines and containers only partially solve the problem. They allow you to re-run an entire software stack easily on another computer, but that does not mean you know exactly what is going on inside that virtual machine image, how was it created, or how the binaries inside were compiled. In many cases, you’re just ignoring the problem and relying on a black box. Then you start relying on multiple VMs and containers to run ONE experiment, with very specific images versions inter-dependencies, and you’re back to square one (possibly in a worse situation).

For more details, I recommend checking C. Titus Brown recent blog post about this problem “How I learned to stop worrying and love the coming archivability crisis in scientific software” and for a funny and historical perspective of how we got here, check out Joe Armstrong’s talk “The Mess We’re In”.

Is there a solution?

I’m going to share my current best approach to this problem. It’s far from perfect, so it would be interesting to know if someone has come up with something better. I think it’s a good enough solution for reproducibility and reusability.

Layers that need to be considered for reproducibility. With the right tools, we can achieve good enough reproducibility and high reusability in the layers inside the bold border.

At the programming language level

Currently I use each programming language package manager to keep track of specific versions of libraries used. Most of the time the best tools (i.e. less buggy) to install libraries for a programming language are the ones that the language uses (npmfor JS, pipfor Python, etc). So, relying on another package manager to do that job (i.e. the OS package manager), in my experience, gives you more trouble than it’s worth. However, sometimes this can have drawbacks for reproducibility (e.g. the left-pad crisis in Node) and we need to be careful.

Whenever possible, we should include the source code of our dependencies in our project and not rely on external sources. In a language like R, versioning and snapshotting the dependencies source code might be difficult, but there are solutions like Packrat. In Node.JS you can easily manage your dependencies with a package.jsonfile and npm shrinkwrap. In Python you can do pip freeze > requirements.txt. For both, there are ways to host your own local repositories.

JavaScript dependencies listed in a package.json file, and Python dependencies in a requirements.txt. For R, you can use Packrat.

At the Operating System level

Instead of relying on the OS packages manager (e.g. aptfor Debian, rpmfor Fedora) to keep track of the languages themselves and all the other dependencies (e.g. some languages require specific versions of Fortran, GCC compiler, etc) I use the Nix package manager.

Nix is a functional package manager that stores packages usually in the directory /nix. Each package has its own unique subdirectory, such as

/nix/store/b6gvzjyb2pg0kjfwrjmg1vfhh54ad73z-firefox-33.1/

which results from hashing the package’s build dependency graph. Don’t worry, you won’t have to deal directly with these paths to use the tools you install. But this will allow you to have deterministic installs and use multiple versions of the same tools without any conflict. And all the dependencies are isolated inside /nix, so whatever you install will not rely on the host system (unless you have some dependency at the Kernel or Firmware level, but you’ll have the same problem with Docker containers).

Nix works on Linux, Mac, and latest Windows 10 Insiders preview’s bash.

How to tie it all in a reproducible paper

So for a reproducible environment, use each language dependency management system for modules and libraries. Then, use Nix to manage the languages themselves and any other OS level dependencies. But how to make the experiment/paper reproducible? How to show the way to a result, a figure or plot? That’s where things like interactive notebooks come into play (Jupyter/iPython, Beaker, nteract). There’s plenty of blog posts out there showing how to use them. Unfortunately, scientific output still relies heavily on static papers (PDF files), not yet on cool interactive self updating figures. So what currently works best for me is to use LaTeX (sometimes Markdown) to typeset the text and generate the PDF, and KnitR to integrate the R code used to generate the plots in the LaTeX file. KnitR also allows you to embed other languages in your paper files, such as Python or JavaScript.

In the next blog post I will show you a practical example of what this looks like, stay tuned!