How to Build a Successful Data Science Workflow

Rongyao Huang
CBI Engineering
Published in
4 min readApr 30, 2020
Photo by Scott Graham on Unsplash

Tell us a bit about your technical process for building data science workflows. What are some of your favorite tools or technologies that you’ve used, and why?

It’s difficult to talk about workflows without first discussing what constitutes a data science (DS) project. In my experience, a typical DS project goes through (up to) three phases, each with a different primary focus. Accordingly, an effective workflow should aim to speed up each stage with respect to its goal.

Below listed are the three stages with their focuses and typical tasks.

With the above in mind, I’d like to share a couple of tools/practices I find useful for each DS phase.

Exploration Stage

  • Jupyter Notebook: this is the single best environment for the exploration stage in my opinion. Its interactive nature and the ability to combine programs with documentation and visuals makes it more efficient for fast prototyping as well as collaboration. Fast.ai has pushed it further by releasing NBDev, a tool that allows you to fully develop a library entirely in Jupyter (including documentation, testing, CI/CD). I’ve tried it with a small project and it’s great.
  • Virtualenv/Docker: isolating your project environment makes it replicable and portable anywhere else. A must-have if you’re combining local/remote development or collaborating with others.
  • Project Log: we’ve gotten into the habit of keeping a log for each project and find it helpful for clearing thoughts, tracking decision paths, sharing ideas/progress. Afterall, who hasn’t had the moment of “why on earth did I do x back then”?

Refinement Stage

  • GUI: Unless you’re one of those do-everything-in-vim/emacs guru, a good GUI makes refactoring a whole lot easier. Whether it’s a quick jump to declaration, looking up references and refactoring in bulk, or auto linting/docstring, all time savers. Our favorites include PyCharm, Atom, VS Code, Sublime, etc.
  • Git: no explanation needed ¯\_( ͡❛ ͜ʖ ͡❛)_/¯
  • Data / Model Versioning: if you do envision having to iterate over multiple data and model versions, it can help to standardize how they’re named, annotated and saved. Simple solution works here, I personally adopt the following practices, all automated through a couple of short functions.
  1. seeding any random component
  2. add UNIX timestamp to standardized file/folder name
  3. save metadata alongside data/models
  4. Backup data/models to S3 on save
  • Google Sheet for Error Analysis and Performance Tracking: shared Team Drive, centralized place for project documents, easy collaboration.
  • Tensorboard: must-check for deep learning projects

Productionization Stage

This is probably the most heterogeneous because every company has a different system setup and deployment process. But we also see the most opportunity here to standardize.

At CBI, we’ve developed templates for jobs and services. A DS solution can simply implement the interfaces of how data should be fetched, prepared, processed and saved. The remaining work of configuration, logging, and deployment are abstracted away. We’re also continuously adding to a utils library where common functionalities like database reads/writes are standardized.

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

There are different approaches to reproducibility, aside from what’s already mentioned (Git, data/model versioning, util library, project logs…), I’d like to talk more about frameworks/templates.

A good example of such a framework is Pytorch-lightning, a lightweight ML wrapper that helps to “organize your PyTorch code to decouple the data science from the engineering” and automate the training pipeline. Anyone who uses the Lightning template is essentially following a workflow that implements standardized interfaces of model configuration, data preparation, training, saving, etc. This makes a lightning styled project easily understandable and reproducible.

However, having made the attempt to build a generalizable ML pipeline myself, I also think this should be a cautious investment, especially for small teams. The general trade-off that occurs when adopting a framework is flexibility v.s. automation. In addition to that, there’s the learning cost for users and the maintenance cost for developers. Therefore, a pipeline is more likely to gain usage when the following conditions apply:

  • It helps with a broad range of tasks
  • It automates a heavy/difficult component in the DS workflow
  • You/your team are committed to maintaining it in the long run

What advice do you have for other data scientists looking to improve how they build their workflows?

  • There’s no one workflow that suits everyone and every project. Identify what’s the primary focus of each project phase and pick tools accordingly.
  • Our field evolves so fast, always on the lookout for new tools as they emerge.
  • Modularization and tests go a long way.

--

--