Orchestrator for ML Pipelines — Vertex AI Pipelines (Kubeflow) vs. Apache Airflow

Saeed Hajebi
4 min readNov 15, 2022

--

Is Vertex AI Pipelines (Serverless Kubeflow) a good choice for orchestrating ML Pipelines?

Google Vertex AI is a new and comprehensive set of tools to support end-to-end ML & MLOps lifecycle. It consists of many tools, including Workbench (Notebooks), Feature Store, Pipelines (Serverless Kubeflow Pipeline), Training Jobs, Experiments and Metadata tracking, Model Registry, Endpoints (for online model deployment), Batch Predictions, etc.

While some of the components are very good, as a whole, the framework is new and not mature in many respects. As an example, the capabilities of Experiments and Metadata tracking are not on par with mlflow’s capabilities. Another example is the lack of a scheduler. Additionally, documentation lacks a lot of important aspects and it’s not easy to find good examples for a lot of capabilities.

In this article, the focus is on one important area: Orchestration. While there are many orchestrators available for pipelines (Airflow, Kedro, Luigi, Kubeflow Pipelines, Argo, MLRun, Prefect, etc.), I would like to narrow down the choices to just two of them which are the most important ones for ML Pipelines, especially on GCP: Vertex AI Pipelines and Apache Airflow.

Vertex AI Pipelines

Vertex AI Pipelines is a tool to automate, monitor, and govern ML systems by orchestrating ML workflow in a serverless manner, and storing workflow’s artifacts using Vertex ML Metadata. It’s in fact a managed (Serverless) Kubeflow Pipelines.

Pros

  • Serverless — little to no infrastructure and deployments overhead
  • Designed & optimised for ML and comes integrated with a lot of ML-related features, like ML Metadata (requires programming)
  • Caching
  • Not expensive (according to Google, the cost is $0.03 per pipeline run + Google Cloud resources cost)

Cons

  • There is no scheduler in Vertex Pipelines — needs other tools like Cloud Scheduler + Cloud Function, Jenkins, Airflow, or custom-built tools for scheduling.
  • No support for Micropipelines — it’s really hard, time-consuming, and costly to develop large end-to-end pipelines.
  • No support for cross-dag interdependency.
  • No support for data-aware pipelines.
  • No support for dynamic dags and tasks creation.
  • No CLI — it’s not possible to test a specific task, backfill, catchup, etc.
  • Not possible to locally run/test pipelines.
  • Although being Serverless is an advantage, there is little control and observability compared to Airflow + Kubernetes.
  • Being Serverless means it runs in Google’s infrastructure (Google owned GCP projects). This could cause a lot of extra complexities and problems for network, VPN, and firewall setup. In majority of cases, pipeline components need to interact with your infrastructure like reading data from database or interacting with endpoints. With Vertex AI Serverless model, there won’t be any static IP to short list, etc.
  • Although Kubeflow is vendor-neutral, Vertex Pipelines is entirely GCP based; so there will be vendor lock-in.
  • Not well tested and proven — there is very little adoption as of now.
  • Not as big of a community to support and share examples.
  • Not many operators are available as of now.
  • Documentation is not good.
  • Very little knowledge and experience in most of ML teams with Vertex AI Pipelines.
  • A rather steep learning curve compared to Airflow (based on personal experience).

Please note apart from being Serverless, almost all other points could be valid for Kubeflow Pipelines as well.

Apache Airflow

Apache Airflow is by far the most widely used orchestrator in the industry. It’s open source, general purpose (not only for ML pipelines), very mature, scalable, reliable, and feature-rich. There is a large community of developers using and supporting Airflow. There is also a large number of Airflow Operators for different tasks and requirements. Additionally, Airflow is easily extendable, so, if there is a need for which there is no good operator available, it’s relatively easy to build one. There is also a good number of people in many companies with good knowledge and experience with Airflow.

While Airflow could be used as a workflow engine which does computations as well, the best practice is to pass the actual computation to Kubernetes nodes or Dataproc or EMR clusters (in case of a Spark job) using KubernetesPodOperator and use Airflow for orchestration and scheduling purposes only.

Pros

  • Airflow has a very mature scheduler.
  • Support for dynamic task and dag creation.
  • Support for data-aware pipelines (Dataset concept introduced in Airflow 2.4).
  • Support for micropipelines; so one can devide and concure large cpomplex dags.
  • Support for cross-dag interdependency.
  • Airflow CLI for managing pipelines (re-run, backfill, catchup, test, etc.).
  • Possible to run Airflow locally for testing and development purposes.
  • Can handle almost any batch pipeline (data pipeline, ML pipeline, etc.).
  • Airflow is Cloud agnostic — so, no vendor lock-in.
  • Airflow is free & open source.
  • Airflow has a huge community (documentation, examples, support for issues, etc.)
  • Airflow has a huge set of operators.
  • Most of Data & ML teams are already using Airflow and have a good knowledge and experience with it.
  • There are Airflow operators for a lot of important Vertex AI capabilities (Datasets, Custom Training Jobs, AutoML, HPT, Model Deployment, Batch Prediction, Endpoints, etc.). As all Vertex AI capabilities are accessible through Python Client Library and/or gcloud CLI, if there is a capability for which there is no Operator, it’s easy to build one using PyhtonOperator or BashOperator.
  • Airflow is relatively easy to learn and use (compared to Vertex AI Pipelines).

Cons

  • Setup and management overhead.
  • Running cost (scheduler node & database must run all the time).

While management overhead and cost are reasonable considerations in general, with a good design, it is possible to minimise the cost of Airflow deployment to a negligible amount just for the scheduler & database nodes which could be tiny pods running on a Kubernetes cluster. It’s possible to downscale worker nodes to zero and make Airflow somehow Serverless.

Conclusion

Based on my personal experience with Vertex AI Pipelines, I believe it is still not ready for real production environments. Apache Airflow is better than Vertex AI Pipelines (and Kubeflow Pipelines) in almost every respect.

--

--

Saeed Hajebi

Everything Data & ML. I am a technology enthusiast working on Data Science, Data Engineering, Machine Learning, and MLOps.