A Quick Guide to Organizing [Data Science] Projects (updated for 2018)
Noble’s advice transcends computational biology and is broadly applicable to the field of data science.
One of the most influential journal articles that I read as a PhD student actually had nothing to do with science. The title was A Quick Guide to Organizing Computational Biology Projects, authored by William Stafford Noble in 2009, and it had detailed, practical advice starting from the proper use of version control software in computational science, all the way down to how you arrange your files and folders.
I revisited the paper several times and it shaped my day-to-day workflows for every single project thereafter. It led me not only to some incredibly useful resources (software carpentry) but also to project management strategies (e.g., Agile/scrum), and general software engineering principles.
To seasoned programmers, it may seem surprising that this stuff was new to me back then, but the standard bioinformatics training curricula can often gloss over practical fundamentals such as version control that might be taken for granted by those with a CS background.
Several years and many projects later, my take on this has been refined and shaped by a few technologies that weren’t around back then. And even while the field of data science has changed shape dramatically, Noble’s advice stays true, transcends computational biology, and is broadly applicable to the field of data science. Below I’ll propose my own updates to A Quick Guide To Organizing Computational Biology Projects, enhanced with technologies not available when it was written.
The core guiding principle set forth by Noble is:
Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why.
Noble goes on to explain that that person is probably yourself in 6 month’s time.
Nearly a decade later, however, new technologies allow us to say that someone unfamiliar with your project should be able to re-run every piece of it and obtain exactly the same result. Three underlying technologies drive this new requirement for perfect reproducibility:
- Virtual Machines (VMs) or Docker containers make it simple to capture complex dependencies and save the exact environment used to execute the code.
- Cloud computing has lowered the barrier to computing power, so having compute-heavy steps is no excuse for non-reproducibility of the entire workflow
- Jupyter and Rmarkdown notebooks mix docs with code and make it easy to share and reproduce interactive analyses.
If these tools are used correctly, any project can be made perfectly reproducible with little effort.
Noble’s second guiding principle is that
Everything you do, you will probably have to do over again.
The only thing I would update here is to remove the word “probably,” as he is being too diplomatic. Reproducibility is not simply a “nice-to-have” for sharing work externally, it is an absolute necessity for the iterative nature of the research process. You need to be able to re-run any analysis 6 months later if you want to be an efficient, productive computational scientist.
File and directory organization
I’ve started a cookiecutter GitHub repo that I use as a starting point for new projects. You can compare this to Fig 1. of Noble’s paper, it’s mostly the same. Essentially, the rule is that everything you need to produce the downstream results is available within the project folder, without any absolute filepaths or symlink references to other folders in the filesystem. You should be able to copy the folder to any server with access to the internet and still have everything it needs to run.
One difference is that I like to reserve a separate “raw” folder for data that goes untouched, and use my “data” folder for processed data generated by the project code. As a rule, I like all files in “data” to be immutable and completely derived either from files in raw or external (universal) APIs, using code from the project.
Noble’s advice about avoiding descriptive folder and filenames is spot-on, because you can never fit in a single filename all the metadata you need to describe a file. Instead, he recommends naming folders chronologically and including metadata and descriptions in a README or research notebook. I tried this technique but found it hard to manage for a long-running project. I often wanted to find some analysis or code, and it took too long to cross-reference my notes (which were admittedly not always complete).
My recommendation is to organize data and results folders by versions. As with semantic versioning of code, you only update the major version of your analysis if it’s backwards-incompatible, i.e., if the data are not cross-comparable between runs due to changes in algorithm, data source, etc. I don’t really use minor version numbers, but you could use them to capture specific checkpoints (e.g., freezing for publication).
The Lab notebook
I think these notebooks can serve two essential purposes: acting as a running log for day-to-day computational experiments, and sharing the results of an analysis. The two types of notebooks aren’t necessarily one in the same, and it’s best to keep several notebooks for logging daily computational experiments, while reserving a separate notebook to share the cleaned-up final analysis.
Also, for Python projects it should be noted that the Jupyter environment isn’t as good as language-specific IDEs for development and iterative exploration. You’re probably better off using IPython to develop and then pasting code into the notebook after it’s matured a bit. Also, keep an eye on JupyterLab, which is Jupyter’s version of an IDE, with tabs for notebooks, source code, terminal access, and variables. Looks fantastic but is currently in early beta, as of October 2017.
For R projects, I’ve switched to RMarkdown for this reason. Its integration with the RStudio IDE is excellent, and the plain-text format works better with git than Jupyter’s JSON notebook files.
Carrying out a single experiment
You’ve written some code and done a bit of analysis, now how do you track everything you did in your computational experiment, so someone else can perfectly reproduce the results? Noble suggests recording everything you do in scripts, documenting and automating as much as possible. I have nothing to add to what is recommended here, but I have strong preferences on how to implement it.
Data science computation comes in two flavors: workflows and analyses. Workflows (also called “pipelines”) are predefined, standardized, generalizable operations that can be run in batch and left to run unsupervised. Analyses are interactive, iterative, heavy on visualization and summary statistics, and unique to a project. Analyses can often mature into batch workflows, but usually any data science project has both aspects.
I recommend a workflow manager for managing and automating everything in a project. Make works well, but I use Luigi, and lots of data scientists have also had success with Snakemake. I’ve written and presented about Luigi before, but the basic idea is that it acts like a Makefile with dependencies, tasks, and targets, but with the flexibility of a Python API. Snakemake offers very similar functionality, but uses its own, Make like domain-specific language (DSL)
Everything you do, from the raw data to the final output, (even manual steps, if absolutely necessary) can be encoded into Luigi or Snakemake tasks and targets. Then, you designate a single “wrapper” target that builds all upstream dependencies and acts as the master switch to build the entire analysis from scratch.
Your notebooks can then be used to record interactive analyses of the output data produced by the workflow, intertwining docs, visualizations, and code that turn the data analysis into a human-readable story. Luigi, being a pure Python library rather than a DSL, enables Tasks to be imported directly into the notebook and used as inputs for the analysis. That way, you don’t have to know the filepath, only the Task class and the parameters describing the instance of that Task. For example, if you have a workflow `my_workflow.py` that has a final task:
class MyFinalTask(luigi.Task): sample_id = luigi.Parameter() def requires(self): return AnotherTask(sample_id=self.sample_id) def run(self): do_stuff(self.sample_id, out=self.output()) def output(self): return luigi.Target('/path/to/output.csv')
Then you can load the workflow results into Jupyter with:
from myworkflow import MyFinalTask task = MyFinalTask(sample_id="sample_1") output = task.output().path# Do some analysis with df df = pandas.read_csv(output) # works with S3 or local paths!
Handling and Preventing Errors
Most of the error-handling section in the original text is pretty standard programming best practice. Don’t fail silently, check error codes, and use battle-tested libraries that have already thought of — and tested for — a lot of the input edge cases.
If you use a workflow manager like Luigi you’re probably covered for Noble’s final point about avoiding partial results. Atomicity is the technical term for the “all-or-nothing” strategy he advocates, where outputs should be written to a temporary file and only copied over to the true output once they are confirmed to have completed. This feature is built right into workflow managers, so use them!
Command Lines versus Scripts versus Programs
How do you decide if something should be a shell script, a higher-level language script (usually Python or R), or a library?
Here Noble separates “scripts” into 1) drivers, 2) single-use, 3) project-specific, and 4) multi-project. For (1), your driver script, I recommend you save yourself a lot of trouble by using Luigi, Snakemake, or Make to track dependencies.
I think this separation of scope between the other categories makes sense, but would avoid the use of “scripts” in all cases except for (2). If you are really, really, really going to do something once, then it is ok to write a one-off script. But in almost all cases you should be writing your code as functions to be imported. Where you store those functions (i.e., a local or global module) depends on whether you think the code is (3) project-specific, or (4) multi-project.
I would take this one step further and recommend you use a Docker container to encapsulate all project-specific code and dependencies in a container image. Write a quick command-line interface to your functions, or stand up a minimal REST API, so they can be accessed through the container. If you’re building with Python, you can do this in minutes with click or argparse (for CLIs), or Flask (for a REST API).
The Value of Version Control
There’s no need to belabor this point further. Even biologists-turned-programmers are starting to catch on to version control, thanks to most scientific software now being available on GitHub and the gradual adoption of VCS by the top academic labs. In data science communities outside of bioinformatics, git is standard practice as well. So check in your code!
Stepping back, I think the major takeaway from this article is that data science is still just code, and should be handled using best practices from software engineering. Data science projects should be versioned with a version-control system (git), built with a build management tool (Make, Snakemake, or Luigi), deployed with a deployment tool (Docker), and shared and documented with a browser-based app (Jupyter or RMarkdown).
Reproducibility in “wet” science is hard, but reproducibility in data science should be easy, if we know how to leverage the right tools.