Engineering Reproducible Data Science Projects

Michael Li
97 Things
Published in
3 min readJun 3, 2019

Just like any scientific field, data science is built on reproducibility. A reproducible project is one where someone else (including future you) can recreate your results by running a simple command. On the one hand, this means that you should check your analysis code into a source control tool like Git. On the other, it also means following dev-ops best practices like including dependency lists in machine-readable forms (like requirements.txt for pip or environment.yml for conda). You might go one step further and use a Dockerfile. The commands needed to install and run the analysis should also be included. Finally, make sure that you clearly document what to run in a README.md or preferably in a job runner like Make.

Another important piece of reproducibility is eliminating something we’ll call algorithmic randomness from your pipeline in order to maintain consistency. If your data are being subsetted from a larger dataset or your analysis depends on an initial random condition (many of your favorite ones do), you’re depending on a random number generator. This can cause the same analysis to yield different results. So make sure your generator is tied to a random seed that’s checked into version control. This ensures that your work can be reproduced, and any variation in your results can be attributed to the code or data, not to chance.

If you work in Python, Jupyter Notebooks can combine code, visualizations, and explanations in a single document. In the academic world, Nobel Prize winners have used notebooks to demonstrate the existence of gravitational waves. In industry, companies like Netflix use notebook templates to deliver visualizations to stakeholders. Don’t be afraid to check notebooks into git (Netflix does it and so do we!). We restart the kernel and rerun all the analysis from scratch before saving the output, which avoids out-of-order execution mistakes and helps guarantee that we’ll get the same results will appear next time we rerun it.

Finally, it’s always smart to begin a data science project with some idea of how it will be put into production. For instance, designing a pipeline that uses the same data format during the research and production phases will prevent bugs and data corruption issues later on. For the same reason, it’s also a good idea to sort out how your research code can be put into production before you start, rather than creating separate code for the latter.

When starting off as a data engineer, it’s tempting to dive into what you see as the “cutting edge” of the field. But, based on our experience, it’s a much better investment to focus on the foundations and make sure you can create reproducible, consistent, and productionizable pipelines that are easily accessible to various stakeholders. Though it may not seem as glamorous at first, it will pay dividends over the life of both your project and your career.

Authors:

Tianhui Li

https://www.oreilly.com/people/76a5b-michael-li

Nicholas Cifuentes-Goodbody

Nicholas Cifuentes-Goodbody is a Data Scientist in Residence at The Data Incubator. Before TDI, he worked at Williams College, Hamad bin Khalifa University (Qatar), and the University of Southern California. He holds an MA and PhD from Yale University.

--

--

Michael Li
97 Things

Pragmatic Institute’s President of Data Sciences. Michael founded The Data Incubator in 2014 as a platform for training and placing Data Scientists.