pakkr (Part I), One Pipeline to Rule Them All

Kerry Chu
Kerry Chu
Jan 30, 2020 · 7 min read
Image for post
Image for post
Original Photo Source:


This blog is a collaborative effort with Eric Pak.


This is a story of how we, data scientists and machine learning engineers in Zendesk, came out of the dark age of machine learning tooling, built our in-house tooling pakkr pipeline, and shared it with the world.

The Dark Age: Life Pre-pakkr and How We Came Here

Nowadays, many companies talk about the importance of having data pipelines. A good machine learning pipeline helps data scientists shorten model development time and productionise machine learning products efficiently. Life pre-pipeline is pretty hard for any data science team: the code is not easily maintainable, deployable, scalable, and reusable. Zendesk is no exception.

Back in 2017 when the Zendesk Data Science team was working on an awesome product called Content Cues, we found ourselves lacking the tooling options to write cleaner, faster and more production-ready code. Even though pipelining is not a new concept in the software realm, there were few pipelining tools available at that time. Scikit-learn Pipeline and Airflow are some examples, but they are either too heavy or don’t have the right level of abstraction for the needs of our team.

Scikit-learn Pipeline requires each step to be an estimator object to implement a specific interface, namely the transformandfitmethods. This paradigm is adequate for sequential transformation/inference but too restrictive for situations where a step produces multiple outputs and not all of them will be consumed in the step directly afterward. Take for example the following diagram.

Image for post
Image for post

Imagine you are working on a 4-step pipeline that consists of

And let’s say you have the data available and a config file to configure your pipeline. You first load and transform the data specified in awesome_config.

input_data_path: s3://input-bucket/source-featuresresult_output_path: my_directorymodel_params: {learning_rate: 0.00001, ...}

Then you split the data into train_data and test_data.

Image for post
Image for post

Then you notice that passing the train_dataand input_data_pathto the next step of the Scikit-learn pipeline is easy. However, when you try to pass the test_dataproduced in Load & Transform (step 1) to Test Model (step 3), it gets tricky to move data around. Same as passingresult_output_path in the config defined in step 1 to Save Results (step 4). Now, you see that you are in a very tough situation.

Image for post
Image for post

The design of an interconnected pipeline simply does not fit the traditional ML workflow/pipeline paradigm which requires less restrictive abstraction and indirect dependency setup/management.

Airflow implements a Directed Acyclic Graph (DAG) to address those requirements. The above situation might look like the following in the DAG form.

Image for post
Image for post

Nonetheless, building a DAG is laborious because

Unsatisfied with the then-existing solutions, on one magical night when everyone else in the team was out enjoying the Melbourne Jazz Music Festival, our Staff Engineer Eric Pak, alone in the Melbourne office, opened his laptop and started writing what later became the legendary pakkr pipeline. (Eric’s awesome blog on ML engineering is currently in the making. Stay Tuned!)

PS. Eric tends to work during music festivals. Below is a photo of him pouring thoughts on a notepad during the Sidney Myer Music Bowl.

Image for post
Image for post

The Origin: Design Thinking & Use Cases

pakkr finds a middle ground between Scikit-learn Pipeline and Airflow Pipeline in terms of complexity. It combines sequential design thinking (so that it is in line with machine learning project workflow) with flexible abstraction and indirect dependency setup and management. pakkr achieves this by implementing three main concepts: invocation, dependency management, and configuration.

Here a glimpse of how pakkr works with Fisher’s Iris dataset before we dive into the technical explanation.

pakkr example code using Fisher’s Iris dataset


The abstraction of pakkr is very generic as pakkr only requires each step to be a Callable. In other words, a step could simply be a function or an object that implements the __call__ Dunder (a.k.a double underscore or magic method). In this way, pakkr combines simple transformation functions as well as transformations needing more complex instantiation together.

And if we take a step (pun intended) back, since a pipeline is also a Callable, pakkr can use pipeline instances as steps in other pipelines. This not only enables reusability but also encourages modularization.

Dependency management

Think about the example we presented before: without using any library, how would you implement a pipeline that consists of 4 sequential steps? If you are using python, you may write it like this:

def pipeline():train_data, test_data, awesome_config = load_and_transform()model = train_model(train_data, awesome_config)results = test_model(model, test_data)save_results(results, awesome_config)

Wouldn’t it be great if we could just define this as below?


What this would imply is that the pipeline needs to track awesome_config and test_dataand only provide them to the relevant steps downstream when needed. pakkr does this by introducing the concept of meta workspace.


Meta is the metadata workspace mechanism in pakkr which is similar to a key/value store. Once objects pass the type checks, meta will hold those objects for us until they are called again in the downstream steps of the pipeline (illustrated in the below diagrams). You may think of the function of meta workspace as memoization or caching. In this way, we can avoid using a DAG and make the pipeline more flexible.

Image for post
Image for post
Image for post
Image for post

Note that meta values should be contained within the lifetime of their pipeline and not be included in the returned values of the pipeline, except for one scenario: downcasting. We will walk you through the concept of downcasting in Part II of the pakkr blog series. Don’t worry about it for now.


We have talked about the “what”, but what about the “how”? How would the pipeline know what needs to be passed onto the subsequent step and what needs to be kept inside the meta workspace for step(s) downstream? The answer is to specify how the output of a step should be interpreted using the returnsfunction.


returns enables us to inject objects into meta and perform type checking of the function return. It can be used both as a decorator or as a regular function. We demonstrate in the example notebook that pakkr uses returns and meta together to interpret returned values.

returns takes types as positional and keyword arguments to verify the return values of the function; this enables us to check both the number of returned values and also their types. Positional arguments will be passed on to the next step and keyword arguments will be inserted into meta.

Command Line Argument

On a GPU or server? No problem! pakkr also parses command-line arguments to pipelines so that you can execute pipelines using CLI. If you need help setting this additional utility up, please visit our repo to see some examples.

The End

In the last two years, pakkr has been helping us with data transformation, model development, and validation (including the models behind Content Cues). It has been a great aid to our day-to-day work and we hope pakkr can help you as well. Hence, we have open-sourced pakkr on PyPi.

We still have some other features and improvements in mind such as mypy integration, dependency visualization and automatic garbage collection in meta to name a few. Hopefully, we will be able to work on those soon. In the meantime, please feel free to raise issues in GitHub if you find any bugs or have any suggestions. With your help, pakkr will become better and be more of a help to everyone.

Special Thank You

We would also like to use this opportunity to say thank you to our Data Science Tech Lead Soon-Ee Cheah for the extremely awesome title and poetry. We would also like to extend our gratitude to all the team members and colleagues who are so generous with their time and ideas to make this blog a great piece of work. Go Data Science! Go Zendesk!

Next Up

We are currently writing up pakkr (Part II), the Many-Faced God in which we will dive into the technical detail and analyze a variety of pakkr use cases.

Zendesk Engineering

Engineering @ Zendesk

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store