Can’t Teach an Old DAG New Tricks

Better CI/CD and Data Pipelines with Trees

Jonathan Marcus
Sep 22 · 5 min read

Pipelines are everywhere in computing. CI/CD has buildtestdeploy pipelines. Data science has compute featurestrain modelbacktest pipelines. ETL literally means extracttransformload. They are ubiquitous, everyone uses DAGs for them, and everyone is wrong.

Pipelines as programs

Let’s say I ask you to build a CI/CD pipeline to deploy a microservice-backed app into AWS. What steps do you take?

– At a high level, 1) stand up some basic AWS infrastructure, 2) build and deploy your containers, 3) test the new environment, and 4) swap it into place using a blue/green or canary strategy.

Wow, good answer. Hang on, I say, those are just high level descriptions! How does, for example, step 2 work?

– For each microservice, build its Docker image, test the image, then deploy it to AWS.

Okay, but how do you test the image?

– Spin up the Docker container locally, send it requests, then tear it down.

I get your point. You have a high-level idea of the tasks required, and each task is composed of ever smaller tasks, until the bottom level is just individual commands that you can execute. It’s just like programming, with high-level functions calling lower-level functions. If I ask you to draw a picture of this pipeline, you’ll probably draw something like this:

So why does every pipeline tool look like this?

The trouble with DAGs

Common wisdom says that DAGs (directed acyclic graphs, basically a workflow that has no cycles) are the correct way to represent a pipeline. Jenkins, CircleCI, Airflow, Oozie, Luigi, Azkaban— there are hundreds of DAG tools, going back decades, that build pipelines with statements like “when A finishes, run B and C” or “only run Z once X and Y finish.” They are the most general and mathematically elegant formulation, no question.

The generality of DAGs comes at a price. When our pipeline is represented as a DAG, it loses all the structure that makes it sensible to humans. Every low-level task is a node in the graph, each is equally significant, and there are no groupings for concepts like “build and deploy your containers”. It is not for lack of trying that every visualization of a DAG is messy. The problem is too hard.

The lack of structure is also a hindrance when defining the DAG. Many tools like CircleCI or Jenkins use YAML files that define a big list of nodes and the dependencies that link them. With no structure to orient the developer, following the flow is challenging. Without the functions and abstractions that programming allows, the files get very repetitive. A YAML file will quickly become fragile, hampering further development.

If not DAGs, then what?

Go back to the natural illustration of the pipeline:

That looks a lot like a directory tree. Would it be possible to just represent the pipeline as a tree, preserving all the structure?

Very possible, it turns out.

Seeing the forest…

Let’s model Step 2 as a tree.

The root is a Parallel node that runs each microservice at the same time. Each microservice is handled by a Serial node that builds, then tests, then pushes. The tests are run by another nested Serial node. The leaves of the tree are Exec nodes that run a command line or a function.

It makes sense visually. What does it look like in Python?

The visualization follows naturally: the UI uses the tree explorer familiar to every computer user. It remains compact and understandable whether your pipeline has 5 nodes or 5,000,000 nodes by following one simple rule: each node contains a summary of all the nodes below it. Just expand to find out more.

Pipelines as code

Nodes should be normal objects that can be returned by functions.

Let’s generalize further. The entire “Build and Deploy” node can itself be returned by a function and used in a Serial node that runs the whole CI/CD pipeline. It could look something like this:

Take a second to think about this. The entire pipeline is just a regular program, following regular programming rules and abstractions. It never mentions a dependency. Instead of thinking about the details of which graph nodes point to which other graph nodes, we just break our problem into small, reusable functions. Each function returns a node that starts, does some work, and finishes when it’s done. Simple as that.

But does it work?

That isn’t pseudocode. Use Conducto to make it work by adding:

Conducto is the pipeline tool—for CI/CD, data science, ETLs, and more—that is based on trees, not DAGs. Define your pipeline naturally in Python (JS and other languages coming soon), and interact with it intuitively in our beautiful UI. Run it for free on your laptop, or at scale in the Conducto cloud. It uses containers to make the transition effortless. No pipeline tool would be complete without making it really easy to debug your errors and polish your new features, and Live Debug is so good it’s almost magical.

Not every problem can be solved with trees. Some problems truly need the flexibility of a DAG, and if you have one please let us know! Conducto’s co-founders built its predecessor to power one of the world’s top quantitative trading teams, driving billions of dollars in revenue over a decade. We used trees to handle massive machine learning pipelines, to store and index petabytes of data, and to continually deploy trading algorithms, optimized to the nanosecond. There were a few times when we wished for DAGs, but the vast majority of problems didn’t need it, and simplicity of trees let us achieve all that with just a handful of engineers. Now Conducto does its own CI/CD in Conducto, and it’s amazing.

We’ve built an incredible tool and we’re excited to share it. The world has plenty of DAG pipeline tools, and zero tree pipeline tools. Well now it has one. We hope you love it.

Conducto

Supercharge Your Pipelines

Thanks to Matt Jachowski

Jonathan Marcus

Written by

CEO and co-founder at Conducto. Former quant developer @JumpTrading. Likes board games, data science, and HPC infrastructure.

Conducto

Conducto

Code, Run, See, and Debug Your CI/CD and Data Science Pipelines.

Jonathan Marcus

Written by

CEO and co-founder at Conducto. Former quant developer @JumpTrading. Likes board games, data science, and HPC infrastructure.

Conducto

Conducto

Code, Run, See, and Debug Your CI/CD and Data Science Pipelines.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store