A Quick Guide to Organizing [Data Science] Projects (updated for 2018)

Jake Feala
Jun 12, 2016 · 8 min read

Noble’s advice transcends computational biology and is broadly applicable to the field of data science.

One of the most influential journal articles that I read as a PhD student actually had nothing to do with science. The title was A Quick Guide to Organizing Computational Biology Projects, authored by William Stafford Noble in 2009, and it had detailed, practical advice starting from the proper use of version control software in computational science, all the way down to how you arrange your files and folders.

Principles

The core guiding principle set forth by Noble is:

Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why.

Noble goes on to explain that that person is probably yourself in 6 month’s time.

  1. Cloud computing has lowered the barrier to computing power, so having compute-heavy steps is no excuse for non-reproducibility of the entire workflow
  2. Jupyter and Rmarkdown notebooks mix docs with code and make it easy to share and reproduce interactive analyses.

Everything you do, you will probably have to do over again.

The only thing I would update here is to remove the word “probably,” as he is being too diplomatic. Reproducibility is not simply a “nice-to-have” for sharing work externally, it is an absolute necessity for the iterative nature of the research process. You need to be able to re-run any analysis 6 months later if you want to be an efficient, productive computational scientist.

File and directory organization

I’ve started a cookiecutter GitHub repo that I use as a starting point for new projects. You can compare this to Fig 1. of Noble’s paper, it’s mostly the same. Essentially, the rule is that everything you need to produce the downstream results is available within the project folder, without any absolute filepaths or symlink references to other folders in the filesystem. You should be able to copy the folder to any server with access to the internet and still have everything it needs to run.

The Lab notebook

The section about lab notebooks reads like a requirements specification for Jupyter or RMarkdown notebooks. Check them out if you haven’t already.

Carrying out a single experiment

You’ve written some code and done a bit of analysis, now how do you track everything you did in your computational experiment, so someone else can perfectly reproduce the results? Noble suggests recording everything you do in scripts, documenting and automating as much as possible. I have nothing to add to what is recommended here, but I have strong preferences on how to implement it.

class MyFinalTask(luigi.Task):    sample_id = luigi.Parameter()    def requires(self):
return AnotherTask(sample_id=self.sample_id)
def run(self):
do_stuff(self.sample_id, out=self.output())
def output(self):
return luigi.Target('/path/to/output.csv')
from myworkflow import MyFinalTask
task = MyFinalTask(sample_id="sample_1")
output = task.output().path
# Do some analysis with df
df = pandas.read_csv(output) # works with S3 or local paths!

Handling and Preventing Errors

Most of the error-handling section in the original text is pretty standard programming best practice. Don’t fail silently, check error codes, and use battle-tested libraries that have already thought of — and tested for — a lot of the input edge cases.

Command Lines versus Scripts versus Programs

How do you decide if something should be a shell script, a higher-level language script (usually Python or R), or a library?

The Value of Version Control

There’s no need to belabor this point further. Even biologists-turned-programmers are starting to catch on to version control, thanks to most scientific software now being available on GitHub and the gradual adoption of VCS by the top academic labs. In data science communities outside of bioinformatics, git is standard practice as well. So check in your code!

Conclusion

Stepping back, I think the major takeaway from this article is that data science is still just code, and should be handled using best practices from software engineering. Data science projects should be versioned with a version-control system (git), built with a build management tool (Make, Snakemake, or Luigi), deployed with a deployment tool (Docker), and shared and documented with a browser-based app (Jupyter or RMarkdown).

Outlier Bio blog

Thoughts about the field of bioinformatics, and how to make…

Jake Feala

Written by

Digitizing biology from hypothesis to experiment and back. Currently Head of Computational Biology at a biotech startup (still in stealth).

Outlier Bio blog

Thoughts about the field of bioinformatics, and how to make it better

Jake Feala

Written by

Digitizing biology from hypothesis to experiment and back. Currently Head of Computational Biology at a biotech startup (still in stealth).

Outlier Bio blog

Thoughts about the field of bioinformatics, and how to make it better

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store