Can’t Teach an Old DAG New Tricks
Better CI/CD and Data Pipelines with Trees
Pipelines are everywhere in computing. CI/CD has build→test→deploy pipelines. Data science has compute features→train model→backtest pipelines. ETL literally means extract→transform→load. They are ubiquitous, everyone uses DAGs for them, and everyone is wrong.
Pipelines as programs
Let’s say I ask you to build a CI/CD pipeline to deploy a microservice-backed app into AWS. What steps do you take?
– At a high level, 1) stand up some basic AWS infrastructure, 2) build and deploy your containers, 3) test the new environment, and 4) swap it into place using a blue/green or canary strategy.
Wow, good answer. Hang on, I say, those are just high level descriptions! How does, for example, step 2 work?
– For each microservice, build its Docker image, test the image, then deploy it to AWS.
Okay, but how do you test the image?
– Spin up the Docker container locally, send it requests, then tear it down.
I get your point. You have a high-level idea of the tasks required, and each task is composed of ever smaller tasks, until the bottom level is just individual commands that you can execute. It’s just like programming, with high-level functions calling lower-level functions. If I ask you to draw a picture of this pipeline, you’ll probably draw something like this:
So why does every pipeline tool look like this?
The trouble with DAGs
Common wisdom says that DAGs (directed acyclic graphs, basically a workflow that has no cycles) are the correct way to represent a pipeline. Jenkins, CircleCI, Airflow, Oozie, Luigi, Azkaban— there are hundreds of DAG tools, going back decades, that build pipelines with statements like “when A finishes, run B and C” or “only run Z once X and Y finish.” They are the most general and mathematically elegant formulation, no question.
The generality of DAGs comes at a price. When our pipeline is represented as a DAG, it loses all the structure that makes it sensible to humans. Every low-level task is a node in the graph, each is equally significant, and there are no groupings for concepts like “build and deploy your containers”. It is not for lack of trying that every visualization of a DAG is messy. The problem is too hard.
The lack of structure is also a hindrance when defining the DAG. Many tools like CircleCI or Jenkins use YAML files that define a big list of nodes and the dependencies that link them. With no structure to orient the developer, following the flow is challenging. Without the functions and abstractions that programming allows, the files get very repetitive. A YAML file will quickly become fragile, hampering further development.
If not DAGs, then what?
Go back to the natural illustration of the pipeline:
That looks a lot like a directory tree. Would it be possible to just represent the pipeline as a tree, preserving all the structure?
Very possible, it turns out.
Seeing the forest…
Let’s model Step 2 as a tree.
The root is a Parallel node that runs each microservice at the same time. Each microservice is handled by a Serial node that builds, then tests, then pushes. The tests are run by another nested Serial node. The leaves of the tree are Exec nodes that run a command line or a function.
It makes sense visually. What does it look like in Python?
The visualization follows naturally: the UI uses the tree explorer familiar to every computer user. It remains compact and understandable whether your pipeline has 5 nodes or 5,000,000 nodes by following one simple rule: each node contains a summary of all the nodes below it. Just expand to find out more.
Pipelines as code
Nodes should be normal objects that can be returned by functions.
Let’s generalize further. The entire “Build and Deploy” node can itself be returned by a function and used in a Serial node that runs the whole CI/CD pipeline. It could look something like this:
Take a second to think about this. The entire pipeline is just a regular program, following regular programming rules and abstractions. It never mentions a dependency. Instead of thinking about the details of which graph nodes point to which other graph nodes, we just break our problem into small, reusable functions. Each function returns a node that starts, does some work, and finishes when it’s done. Simple as that.
But does it work?
That isn’t pseudocode. Use Conducto to make it work by adding:
from conducto import Parallel, Serial, Exec
Conducto is the pipeline tool—for CI/CD, data science, ETLs, and more—that is based on trees, not DAGs. Define your pipeline naturally in Python (JS and other languages coming soon), and interact with it intuitively in our beautiful UI. Run it for free on your laptop, or at scale in the Conducto cloud. It uses containers to make the transition effortless. No pipeline tool would be complete without making it really easy to debug your errors and polish your new features, and Live Debug is so good it’s almost magical.
Not every problem can be solved with trees. Some problems truly need the flexibility of a DAG, and if you have one please let us know! Conducto’s co-founders built its predecessor to power one of the world’s top quantitative trading teams, driving billions of dollars in revenue over a decade. We used trees to handle massive machine learning pipelines, to store and index petabytes of data, and to continually deploy trading algorithms, optimized to the nanosecond. There were a few times when we wished for DAGs, but the vast majority of problems didn’t need it, and simplicity of trees let us achieve all that with just a handful of engineers. Now Conducto does its own CI/CD in Conducto, and it’s amazing.
We’ve built an incredible tool and we’re excited to share it. The world has plenty of DAG pipeline tools, and zero tree pipeline tools. Well now it has one. We hope you love it.