Over the summer, Arthur has been hard at work building our new microservice-based platform. The new platform is a complete overhaul of the Arthur product that includes new Golang based services, a state of the art OLAP database, a revamped UI and stream-based inference enrichments (more on the new platform in an upcoming blog post). Today we are going to focus on inference enrichments, which are a powerful feature that allow customers to add insights such as explanations, anomaly detection, and bias mitigation to their inferences. With each enrichment the customer enables, the platform will add one or more properties to every inference as it arrives. For example, anomaly detection adds an
anomaly_score column to every inference, which can be used to trigger alerts and help with investigations in the Arthur UI. Under the hood, each of these enrichments are separate machine learning pipelines trained per customer model. This workflow shows a simplified version of what these pipelines look like:
Orchestrating and monitoring all of the enrichment machine learning pipelines proved to be a significant challenge, especially since customers can turn enrichments on and off on demand. To complicate things further, Arthur supports a “deploy anywhere installation” that customers can use to run the product in a private network. This meant our orchestration solution also needed to be self-contained, easy to standup, and have a low resource overhead. During our platform rebuild, we faced an important decision. Which of the open-source orchestration tools could support our needs?
We evaluated a handful of workflow orchestrators in our search. But for brevity, I will only discuss the most competitive contenders: Airflow, the most popular Python-based framework, Argo, a Kubernetes-native workflow orchestrator, and Prefect, a Python-based newcomer with some advanced workflow semantics.
Airflow, Argo, and Prefect
Before diving into the tools, I will preface that this post is intended for readers with some knowledge of the problems workflow orchestrators solve. If you’re new to any of the tools and looking to brush up, here are a few sections I thought were good introductions to each:
Airflow: I recommend starting with their docs and specifically, the concepts section.
Argo: Argo’s docs are a bit on the lighter side but their concepts section is a helpful starting point. They also have an active Slack community.
Prefect: Prefect has thorough docs but it is confusingly split into two sections. The core section describes the workflow execution and definition language. The orchestration section describes the Prefect server and scheduling. Prefect also has an active Slack community.
Ok now diving in…
This table is an extensive comparison of all the features we looked at between the tools. For a walk-through of the highlights, see the description below the table.
With any tool bake-off, it’s important to understand what it can do well, and where it lacks important features. In the case of Airflow, it got many features right that have established it as the community favorite:
- Fault-tolerant scheduling and running time aware jobs
- Python based DAG definition
- Built-in library of operators
On the flip side, Airflow’s design choices give it some distinct drawbacks:
- No dynamic workflows
- No parameterized workflows
- Minimal rest API
- Hard to run containerized workflows with Kubernetes
*Note, Airflow 2.0 is on the horizon and may alleviate some of these pain points, see here for the latest.
Compared to Airflow, Argo is a relatively newer project (7k stars on Github vs Airflow’s 19.4k), but already has a large community following. It is currently a Cloud Native Computing Foundation incubating project and has an ecosystem of related tools such as ArgoCD, Argo Rollouts and Argo Events. Argo’s slogan is to “get stuff done with Kubernetes” and has some important features to back up that claim:
- Running containers
- Highly parallel
- Low-latency scheduler
- Can pass data and artifacts between steps
- Dynamic DAGs
- Interacting with Kubernetes resources
- Kubernetes native state storage
- Easy to deploy
- Event-based ecosystem
Clearly, Argo is all-in on Kubernetes, which means it falls short for a number of use cases:
- No operators
- No easy way to execute Python functions directly
- YAML Workflow definition
- Requires Kubernetes
- Not fault-tolerant scheduling
Lastly Prefect, which is the newest member of the three (5.5k stars on Github), combines some of the best of both Airflow and Argo:
- Highly available scheduler
- Lightweight Python-based DAG definition language
- Dynamic DAGs
- Running containers
- Built-in library of operators
Even though Prefect has a lot of the best features of both Airflow and Argo, there are some limitations:
- No open source Kubernetes deployment
- Made up of a complex group of services
- Must execute workflow definitions to register them
Making the Right Decision for Arthur
Airflow and Prefect both do a lot of things well. Airflow outshines both Argo and Prefect for uses cases in non-containerized environments with strict, fault-tolerant execution schedules. With the upcoming release of Airflow 2.0 and improved Kubernetes integrations, it will be better equipped to compete with Argo in containerized environments. On the other hand, Prefect’s DAG definition language is very well designed and intuitive to insert into existing code. Had our use case not required a lightweight tool or been fully containerized, Airflow or Prefect likely would have been our choice. In the end, we chose Argo because it is container-first, lightweight and has out of the box Golang clients.
Being container-first was the most important factor because we run our platform entirely in Kubernetes, so our jobs were already containerized. We needed a workflow orchestrator that excelled at running and managing containers, and Argo was the best tool we found. It has out of the box support for almost any Kubernetes container setting you could ask for. We were able to configure Service Accounts for dynamic Spark clusters (more on these in a later post), Annotations for Kube2IAM AWS access, environment variable mounts from Secrets as well as options for init containers and sidecars easily. In the future we look forward to trying out Argo’s Daemon Containers that let you spin up long-lived services to use throughout the workflow.
Second to being container-first, Argo’s small footprint was a huge bonus for the tool. The core workflow functionality is implemented in a single pod and has an optional UI that requires a secondary pod. Since we package and ship our platform to customers’ VPCs, minimizing the number of pods is a must. Running Airflow with a similar setup required a web server, scheduler, a handful of workers each with sidecars, and a database. Similarly, Prefect has a database, GraphQL API, management services, agents, and flow storage processes. Argo allows us to accomplish more with a significantly smaller footprint than the other tools.
Lastly, one of the more powerful features when using Argo and Go-based applications is the ability to use the open-source Argo Golang Kubernetes client (Note this isn’t specific to Go, someone could implement a client in any language). The client is the same client Argo uses internally to edit and respond to Workflow objects submitted to Kubernetes. As a result, our application-layer can dynamically define Workflow structs and submit them on the fly. This allows us to move a lot of the complex logic out of our workflow into the application code. As an example, in the Arthur platform customers can toggle enrichments for their model and the backend will train a model for each enabled enrichment. Normally with a static, predefined workflow, we would need to write logic into the DAG to handle every possible combination of customer-enabled features, so the DAG can work for all customers. With Argo and the Kubernetes client, we can instead construct the DAG dynamically as a function of the customer’s chosen enrichments. This allows us to submit unique workflows for each customer, even across different invocations of the same customer’s workflow over time. See here for an example setting up and submitting a workflow with the Go Argo Kubernetes client.
Argo is a powerful Kubernetes workflow orchestration tool. It is container-first, lightweight, and easy to integrate with external systems, especially Go-based services. Although seemingly minor, the ability to construct workflows programmatically and specific to each customer has been a huge win for our platform. We have already begun building our machine learning pipelines using this method. The pipelines are event-driven, tailor-built to each customer, and fully handle training, saving, building, deploying, and configuring dynamic pipelines on demand. In the short few weeks we’ve had Argo running in our platform, its power and flexibility has excited our team and streamlined our machine learning orchestration. We’re looking forward to seeing what features we can build next with Argo.
Keep an eye out for some more Argo tips and tricks in following blog posts!