Why Data Scientists Pipeline Their Code
As the data science discipline matures, it’s clear that models can’t exist in a vacuum: end-to-end pipelines are critical for productizing and extracting business value from your machine learning initiatives. This post will take a look at what it means to build ML pipelines, why they are becoming a standard for data teams, and how pipelines can be a huge benefit for teams even when they in the beginning stages of their efforts.
What’s a machine learning pipeline?
Pipelines are a way to structure the workflow code that produces a model into modular, reusable pieces. Your typical workflow might involve a few dependent elements.
- Querying and storing data
- Data cleaning, preprocessing, and feature extraction
- Splitting your data
- Training and fitting your model
- Evaluating your trained model
Many data scientists, especially if you’re using Jupyter notebooks, will put all of these different steps together in one file, or maybe a few separate .py or .r files.
In a pipeline, each part of this workflow can be split to cleanly separated modules with clearly defined inputs and outputs. Those modules (we call them “tasks” in our system) interact with each other through defined interfaces, can be thoroughly tracked and versioned, and should be able to be executed independently.
Designing things modularly can carry some additional overhead in the beginning, but offers significant and interesting benefits for your overall development cycle.
What’s the point of pipelining my model workflow?
Conceptually, pipelining borrows the same principles as object oriented programming, or even microservices: developing complex processes as interconnected individual units of work drives reusability, reproducibility, scalability, and the ability to quickly iterate. Hold on through the buzzwords because the concepts are actually pretty intuitive.
Reuse existing modules instead of constantly rewriting code
All the models you develop will require data access, data cleaning, and model evaluation (among other things) — why rewrite this code every time you build a new one? In a pipelined system, these can all be defined as separate invoke-able tasks that interact with each other like modular functions.
Let’s say we are building a model that predicts the quality of different wines based on some characteristics (as if anyone can tell the difference in the first place…). Your model might have a “get wine quality data” task that returns a cleaned dataset on wines with various attributes and different quality levels. Every subsequent model built off of this dataset can reuse these tasks: no more copying, pasting, or rewriting whenever you need a new iteration or build a new notebook.
Experiment and track instead of tweak and hope
Past getting and cleaning data, modeling is about tinkering: changing your variables, featurization, and hyperparameters until you get the right fit. These tweaks happen across different parts of the model workflow, so it can be really hard to always track what you’re changing, what works, and why it works (Increased your number of net layers? Categorized a numeric variable?).
In a pipelined architecture, you can independently track the changes you make to each task, the difference in data flows, and the impact they have on your end results; carefully documented experimentation as opposed to random tweaking.
Reproduce and version like production-grade application code
Using set.seed() doesn’t mean your models are reproducible — deploying and using models in production requires the same sophistication and weight we give to production-grade application code. With your model code organized as independent tasks, you can version everything, test properly by swapping out components and validating the pipeline works as intend, and make debugging a lot more targeted.
Scale your models to the right infrastructure
As you train your model on larger and larger datasets and build models for more use cases, you’ll need to start worrying about how you run your underlying infrastructure. Building pipelines allows you to allocate specific tasks to the right compute for the job so that you can save a lot of compute time and money (think GPUs for training, CPUs for data querying, etc.). You can also parallelize tasks that can run asynchronously to further cut down on waiting time.
Even more critical at the early stages of building a data science practice
Even before data science teams focus on full scale machine learning systems, pipelines are a critical part of creating the right foundation. OptimalQ has been using pipelines as part of their customer engagement platform for as long as the company has been building models; it helps them avoid re-writing code, isolate problems and debug, and experiment quickly and efficiently.
OptimalQ builds ML models that predict customer availability to help organization reach people more effectively. Building these models requires a lot of large event data tables. Featurizing those tables into good predictors requires a bunch of processing steps; without dividing the work into smaller independent tasks, it simply wouldn’t be manageable.
“When you need to deliver a complete product and you’re starting from disorganized data, creating modular tasks is a must. Pipelining allows us to build a repository of reusable components and decouple models from underlying features.” — Tomer Levy, Data Scientist @ OptimalQ
But hold on! Before you get all excited and start splitting up your notebook code, it’s worth noting something important: pipelines are not necessary for everyone. If your data is already super organized and cleaned, if you’re exploring a small project, or if you don’t expect your model to be useful in a product, it may not be worth the time. Spending the time on organization is a waste if you’re not going to get anything out of it related to your project goals. But if you building models that are going to be used in real-world products, you can save yourself and your team a bunch of time and headaches — now and down the road — by thinking through how to modularize.