pakkr (Part I), One Pipeline to Rule Them All

Kerry Chu
Kerry Chu
Jan 30 · 7 min read
Original Photo Source: https://www.theverge.com/2017/11/13/16644782/the-lord-of-the-rings-amazon-television-show

Co-Author

This blog is a collaborative effort with Eric Pak.

Prequel

This is a story of how we, data scientists and machine learning engineers in Zendesk, came out of the dark age of machine learning tooling, built our in-house tooling pakkr pipeline, and shared it with the world.

The Dark Age: Life Pre-pakkr and How We Came Here

Nowadays, many companies talk about the importance of having data pipelines. A good machine learning pipeline helps data scientists shorten model development time and productionise machine learning products efficiently. Life pre-pipeline is pretty hard for any data science team: the code is not easily maintainable, deployable, scalable, and reusable. Zendesk is no exception.

Back in 2017 when the Zendesk Data Science team was working on an awesome product called Content Cues, we found ourselves lacking the tooling options to write cleaner, faster and more production-ready code. Even though pipelining is not a new concept in the software realm, there were few pipelining tools available at that time. Scikit-learn Pipeline and Airflow are some examples, but they are either too heavy or don’t have the right level of abstraction for the needs of our team.

Scikit-learn Pipeline requires each step to be an estimator object to implement a specific interface, namely the transformandfitmethods. This paradigm is adequate for sequential transformation/inference but too restrictive for situations where a step produces multiple outputs and not all of them will be consumed in the step directly afterward. Take for example the following diagram.

Imagine you are working on a 4-step pipeline that consists of

  1. Load&Transform
  2. Train Model
  3. Use Model
  4. Save results.

And let’s say you have the data available and a config file to configure your pipeline. You first load and transform the data specified in awesome_config.

Then you split the data into train_data and test_data.

Then you notice that passing the train_dataand input_data_pathto the next step of the Scikit-learn pipeline is easy. However, when you try to pass the test_dataproduced in Load & Transform (step 1) to Test Model (step 3), it gets tricky to move data around. Same as passingresult_output_path in the config defined in step 1 to Save Results (step 4). Now, you see that you are in a very tough situation.

The design of an interconnected pipeline simply does not fit the traditional ML workflow/pipeline paradigm which requires less restrictive abstraction and indirect dependency setup/management.

Airflow implements a Directed Acyclic Graph (DAG) to address those requirements. The above situation might look like the following in the DAG form.

Nonetheless, building a DAG is laborious because

  • each step needs to be defined first
  • all steps need to be instantiated
  • managing the relationship between each node of the graph is a necessity

Unsatisfied with the then-existing solutions, on one magical night when everyone else in the team was out enjoying the Melbourne Jazz Music Festival, our Staff Engineer Eric Pak, alone in the Melbourne office, opened his laptop and started writing what later became the legendary pakkr pipeline. (Eric’s awesome blog on ML engineering is currently in the making. Stay Tuned!)

PS. Eric tends to work during music festivals. Below is a photo of him pouring thoughts on a notepad during the Sidney Myer Music Bowl.

The Origin: Design Thinking & Use Cases

pakkr finds a middle ground between Scikit-learn Pipeline and Airflow Pipeline in terms of complexity. It combines sequential design thinking (so that it is in line with machine learning project workflow) with flexible abstraction and indirect dependency setup and management. pakkr achieves this by implementing three main concepts: invocation, dependency management, and configuration.

Here a glimpse of how pakkr works with Fisher’s Iris dataset before we dive into the technical explanation.

pakkr example code using Fisher’s Iris dataset

Invocation

The abstraction of pakkr is very generic as pakkr only requires each step to be a Callable. In other words, a step could simply be a function or an object that implements the __call__ Dunder (a.k.a double underscore or magic method). In this way, pakkr combines simple transformation functions as well as transformations needing more complex instantiation together.

And if we take a step (pun intended) back, since a pipeline is also a Callable, pakkr can use pipeline instances as steps in other pipelines. This not only enables reusability but also encourages modularization.

Dependency management

Think about the example we presented before: without using any library, how would you implement a pipeline that consists of 4 sequential steps? If you are using python, you may write it like this:

Wouldn’t it be great if we could just define this as below?

What this would imply is that the pipeline needs to track awesome_config and test_dataand only provide them to the relevant steps downstream when needed. pakkr does this by introducing the concept of meta workspace.

Meta

Meta is the metadata workspace mechanism in pakkr which is similar to a key/value store. Once objects pass the type checks, meta will hold those objects for us until they are called again in the downstream steps of the pipeline (illustrated in the below diagrams). You may think of the function of meta workspace as memoization or caching. In this way, we can avoid using a DAG and make the pipeline more flexible.

Note that meta values should be contained within the lifetime of their pipeline and not be included in the returned values of the pipeline, except for one scenario: downcasting. We will walk you through the concept of downcasting in Part II of the pakkr blog series. Don’t worry about it for now.

Configuration

We have talked about the “what”, but what about the “how”? How would the pipeline know what needs to be passed onto the subsequent step and what needs to be kept inside the meta workspace for step(s) downstream? The answer is to specify how the output of a step should be interpreted using the returnsfunction.

Returns

returns enables us to inject objects into meta and perform type checking of the function return. It can be used both as a decorator or as a regular function. We demonstrate in the example notebook that pakkr uses returns and meta together to interpret returned values.

returns takes types as positional and keyword arguments to verify the return values of the function; this enables us to check both the number of returned values and also their types. Positional arguments will be passed on to the next step and keyword arguments will be inserted into meta.

Command Line Argument

On a GPU or server? No problem! pakkr also parses command-line arguments to pipelines so that you can execute pipelines using CLI. If you need help setting this additional utility up, please visit our repo to see some examples.

The End

In the last two years, pakkr has been helping us with data transformation, model development, and validation (including the models behind Content Cues). It has been a great aid to our day-to-day work and we hope pakkr can help you as well. Hence, we have open-sourced pakkr on PyPi.

We still have some other features and improvements in mind such as mypy integration, dependency visualization and automatic garbage collection in meta to name a few. Hopefully, we will be able to work on those soon. In the meantime, please feel free to raise issues in GitHub if you find any bugs or have any suggestions. With your help, pakkr will become better and be more of a help to everyone.

Special Thank You

We would also like to use this opportunity to say thank you to our Data Science Tech Lead Soon-Ee Cheah for the extremely awesome title and poetry. We would also like to extend our gratitude to all the team members and colleagues who are so generous with their time and ideas to make this blog a great piece of work. Go Data Science! Go Zendesk!

Next Up

We are currently writing up pakkr (Part II), the Many-Faced God in which we will dive into the technical detail and analyze a variety of pakkr use cases.

Zendesk Engineering

Engineering @ Zendesk

Thanks to Eric Pak, Arwen Griffioen, Dana Ma, Paul Gradie, and Adel Smee

Kerry Chu

Written by

Kerry Chu

Data Scientist

Zendesk Engineering

Engineering @ Zendesk

More From Medium

More from Zendesk Engineering

More from Zendesk Engineering

Open sourcing node-publisher

More on Machine Learning from Zendesk Engineering

More on Machine Learning from Zendesk Engineering

A minimal Python client for talking to TensorFlow Serving

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade