5 Easy Steps to Make Your Data Science Project Reproducible

by Brooke Kennedy

Opex Analytics
The Opex Analytics Blog
6 min readAug 21, 2018

--

We’ve all been there. After all of your pre-processing and modeling work, you finally have your data science project up and running!

A few weeks later, though, you come back to run it again…this time to no avail.

Although good results are obviously desirable, it’s also important that they be reproducible. You want to be able to conclusively show they are valid by allowing anyone to replicate the steps you took to derive them.

It’s actually not as hard as it may seem. In fact, here are five easy steps you can follow while developing your data science project in order to make your workflow intuitive to others, as well as minimize the likelihood of errors occurring when your work is reproduced.

1. A Good Project Structure

An organized directory and project structure is the foundation of a reproducible data science workflow. It should be laid out clearly and logically, so that those unfamiliar with the specific problem still know where to look for certain components, making it easy to quickly understand your work.

One available tool you can use to create such a project structure is DrivenData’s Cookiecutter Data Science, a Python package designed to increase the ease with which one designs and organizes a data science project by offering a customizable skeleton and some other related features. It’s easily installed from the terminal and describes itself as a “logical, reasonably standardized, but flexible project structure for doing and sharing data science work.”

Although geared towards Python, cookiecutter provides a good directory structure that can be used across languages, along with boilerplate code and templates for common tasks within a data science workflow. Of course, a directory structure like this can be modified to meet your individual project needs. The main idea: that each folder has a specific purpose clear to those looking at the repository for the first time.

Default cookiecutter directory structure.

2. Make Use of Virtual Environments

Another important step in reproducibility? Having a record of all of the software requirements and versions needed for the project. One way to achieve this is to use Python’s virtual environments, provided by the virtualenv package.

Virtual environments allow each of your projects to have its own dependencies, regardless of what requirements any other project has. Each time you activate a new virtual environment, it’s as if you’re developing it in a fresh installation of Python, allowing you to easily export and install all of the packages as well as the corresponding versions used in the solution’s development, and not worry about any other nagging Python issues that have accumulated over time.

Once a virtual environment has been activated and any packages have been installed, you can create a requirements.txt file through the command ‘pip freeze > requirements.txt.’ This will then list all of the requirements in a way that allows others to install them through the command ‘pip install -r requirements.txt’. Such an easy step can save a lot of future headaches!

Example requirements.txt file.

3. Follow Best Practices while Coding

It’s also always a good idea to follow coding best practices when developing a data science project.

Two best practices to ensure project reproducibility include:

  1. Following a coding style guide, such as PEP 8 if you are working in python.
  2. Include logging, which lets you set information to be displayed when a program is executing.

Following a coding style guide allows you to maintain consistency across the project as well as the programming language as a whole. This can define everything from how you should indent your code and utilize whitespace to variable naming conventions. These recommendations enhance readability and understanding, allowing other users to more easily read and build onto the project in a consistent manner.

In addition, logging is critical if you want to diagnose problems or understand what’s happening with your application. The logging package in python allows you to set messages during certain events in your program’s execution. For example, you can specify if you want to write to a log during a normal run of your code, or only when an error occurs. By implementing logging, you can cut down on the amount of time spent debugging if an error does arise.

4. Documentation, Documentation, Documentation

Documentation is crucial to any project that’s going to have multiple people reading, using, or developing it. It can and should cover everything, including how a particular function inside your script works, or how you run the project as a whole. Sphinx is one tool that can easily create interactive documentation.

Sphinx scans all of the code in your project and turns simple docstrings into HTML, LaTex, and other forms of stylized documentation. Of course, Sphinx also comes equipped with different themes so you can customize the documentation to your liking. Typical guidance includes descriptions of what a function does, the parameters it takes, or the output it produces. Written guidance allows others to easily understand and make use of your work, without forcing them to automatically dive into the details of the code.

5. Automation

After you’ve created all of your scripts to successfully build a data science solution, it’s now time to automate the process of getting it up and running. A typical data science workflow will involve some sort of data collection, data preparation, data analysis and presentation of results. It’s important to run these processes in the correct order with the correct dependencies. A makefile is one tool that can be used for both automation and to enforce dependencies.

A makefile works on the principle that files only need recreating if their dependencies are newer than the file being created/recreated.

It consists of:

  1. A target (an action to carry out, or the name of a file generated by a program)
  2. Dependencies (files used as input to create the target)
  3. System commands or recipes (the terminal commands used to carry out the task)

Makefiles let you easily chain together commands. If one dependency changes, it executes everything downstream of the altered dependency. This is very useful in data science workflows: instead of executing five different scripts, you can chain them all in a makefile to ensure the proper order, and only execute the parts of the workflow that have been modified or have modified dependencies. One simple command can simplify your development, save you time, and potentially cut down on computing cost.

Makefile structure.

After following these five steps, you should have a more robust data science workflow, ready for anyone to understand your work, reproduce your results, or continue your project’s development.

If you liked this blog post, check out more of our work, follow us on social media (Twitter, LinkedIn, and Facebook), or join us for our free monthly Academy webinars.

--

--