Sustainable Machine Learning Workflows

Daniel Whitenack
Pachyderm Community Blog
4 min readFeb 24, 2017

Let’s face it: deploying, updating, and generally managing/maintaining machine learning models can be a nightmare! Organizations are struggling to find sustainable and repeatable workflows for ML development.

As a data scientist, it’s difficult to familiarize the rest of an engineering organization with your special suite of ML tools/frameworks. This often leads to one of two things:

  1. Data scientists manually deploying ML applications on their own outside of the production standards of the rest of the organization. This produces ML applications that don’t scale, fail often, and aren’t trusted
  2. Data scientists handing off their applications to other engineers for production implementations. This results in code that is only debuggable by people who didn’t write it, breakdowns in the integrity of analysis, and general confusion.

Josh Wills concisely describes this as the “infinite loop of sadness.”

Towards something better

To get out of this vicious cycle we need a way for data scientists to deploy, update, and scale their ML applications while maintaining flexibility to use the tooling they need. We also need a way for data scientists and non-data scientists to consistently reproduce production ML behavior, both for debugging and for incremental improvements to modeling.

Data scientists should spend their time focused on improving their ML applications. They shouldn’t have to spend crazy amounts of time manually keeping their ML applications up-to-date with constantly changing production data. They also shouldn’t have to waste their days trying to retroactively identify and track interesting past behavior.

ML in Pachyderm — a sustainable solution

Because Pachyderm is language/framework agnostic and because it easily distributes analyses over large data sets, data scientists can use whatever tooling they like for ML. Even if that tooling isn’t familiar to the rest of an engineering organization, data scientists can autonomously develop and deploy scalable solutions via containers. Moreover, Pachyderm’s pipelining logic paired with data versioning (think “git for data”), allows any results to be exactly reproduced (e.g., for debugging or during the development of improvements to a model).

All of that is great, and a huge step in the right direction. However, we can do even better! We can combine a model training process, persisted models, and a model utilization process (e.g., making predictions or generating results) into a single Pachyderm pipeline DAG (Directed Acyclic Graph). Such a pipeline is easily defined via a simple JSON pipeline specification that references the various pieces of the DAG, and it allows us to:

  • Keep a rigorous historical record of exactly what models were used on what data to produce which results.
  • Automatically update online ML models when training data or parameterization changes.
  • Easily revert to other versions of an ML model when a new model is not performing or when “bad data” is introduced into a training data set.

This sort of sustainable ML pipeline looks like this:

At any time, a data scientist or engineer could update the training dataset utilized by the model to trigger the creation of a newly persisted model in the versioned collection of data (aka, a Pachyderm data “repository”) called “Persisted model.” This training could utilize any language or framework (Spark, Tensorflow, sci-kit learn, etc.) and output any format of persisted model (pickle, XML, POJO, etc.).

Regardless of framework, the model will be automatically updated and is versioned by Pachyderm. This process explicitly tracks what data was flowing into which model AND exactly what data was used to train that model.

In addition, when the model is updated, any new input data coming into the “Input data” repository will be processed with the updated model. Old predictions can be re-computed with the updated model, or new models could be backtested on previously input and versioned data. No more manual updates to historical results or worrying about how to swap out ML models in production!

Conclusion/resources

Get out of the infinite loop of sadness! Deploy and version your models easily with Pachyderm and scale them to production consistently.

Be sure to:

--

--