Comparison of Python pipeline packages: Airflow, Luigi, Gokart, Metaflow, Kedro, PipelineX

The same content is available at:https://github.com/Minyus/Python_Packages_for_Pipeline_Workflow

This article compares open-source Python packages for pipeline/workflow development: Airflow, Luigi, Gokart, Metaflow, Kedro, PipelineX.

In this article, terms of “pipeline”, “workflow”, and “DAG” are used almost interchangeably.

Summary

  • 👍👍: better

Airflow

Released in 2015 by Airbnb.

Airflow enables you to define your DAG (workflow) of tasks in Python code (an independent Python module).

(Optionally, unofficial plugins such as dag-factory enables you to define DAG in YAML.)

Pros:

  • Provides distributed computing option (using Celery).
  • DAG definition is modular; independent from processing functions.
  • Workflow can be nested using SubDagOperator.
  • Supports Slack notification.

Cons:

  • You need to write file access (read/write) code.
  • Does not support automatic pipeline resuming option using the intermediate data files or databases.

Luigi

Released in 2012 by Spotify.

Luigi enables you to define your pipeline by child classes of Task with 3 class methods (requires, output, run) in Python code.

Pros:

  • You can write code so any data can be passed between dependent tasks.
  • Provides GUI with features including DAG visualization, execution progress monitoring.

Cons:

  • Pipeline definition, task processing (Transform of ETL), and data access (Extract&Load of ETL) are tightly coupled and not modular. You need to modify the task classes to reuse in future projects.

Gokart

Released in Dec 2018 by M3.

Gokart works on top of Luigi.

Pros:

  • Can split task processing (Transform of ETL) from pipeline definition using TaskInstanceParameter so you can easily reuse them in future projects.
  • Provides built-in file access (read/write) wrappers as FileProcessor classes for pickle, npz, gz, txt, csv, tsv, json, xml.
  • Saves parameters for each experiment to assure reproducibility. Viewer called thunderbolt can be used.
  • Reruns tasks upon parameter change based on hash string unique to the parameter set in each intermediate file name. This feature is useful for experimentation with various parameter sets.
  • Syntactic sugar for Luigi’s requires class method using class decorator.
  • Supports Slack notification.

Cons:

Metaflow

Released in Dec 2019 by Netflix.

Metaflow enables you to define your pipeline as a child class of FlowSpec that includes class methods with step decorators in Python code.

Pros:

Cons:

  • Pipeline definition, task processing (Transform of ETL), and data access (Extract&Load of ETL) are tightly coupled and not modular. You need to modify the task classes to reuse in future projects.
  • Does not support GUI.
  • Not much support for GCP & Azure.
  • Does not support automatic pipeline resuming option using the intermediate data files or databases.

Kedro

Released in May 2019 by QuantumBlack, part of McKinsey & Company.

Kedro enables you to define pipelines using list of node functions with 3 arguments (func: task processing function, inputs: input data name (list or dict if multiple), outputs: output data name (list or dict if multiple)) in Python code (an independent Python module).

Pros:

  • Any data format support can be added by users.
  • Pipeline definition, task processing (Transform of ETL), and data access (Extract&Load of ETL) are independent and modular. You can easily reuse in future projects.
  • Pipelines can be nested. (A pipeline can be used as a sub-pipeline of another pipeline. )
  • GUI (kedro-viz) provides DAG visualization feature.

Cons:

  • GUI (kedro-viz) does not provide execution progress monitoring feature.
  • Package dependencies which are not used in many cases (e.g. pyarrow) are included in the requirements.txt.

PipelineX:

Released in Nov 2019 by a Kedro user (me).

PipelineX works on top of Kedro and MLflow.

PipelineX enables you to define your pipeline in YAML (an independent YAML file).

Pros:

  • Supports automatic pipeline resuming option using the intermediate data files or databases.
  • Optional syntactic sugar for Kedro Pipeline. (e.g. Sequential API similar to PyTorch (torch.nn.Sequential) and Keras (tf.keras.Sequential))
  • Optional syntactic sugar for Kedro DataSet catalog. (e.g. Use file name in the file path as the dataset instance name)
  • Backward-compatible to pure Kedro.
  • Integration with MLflow to save parameters, metrics, and other output artifacts such as models for each experiment.
  • Integration with common packages for Data Science: PyTorch, Ignite, pandas, OpenCV.
  • Additional DataSet including image set (a folder including images) useful for computer vision applications.
  • Lean project template compared with pure Kedro.

Cons:

  • Package dependencies which are not used in many cases (e.g. pyarrow) are included in the requirements.txt of Kedro.
  • PipelineX is developed and maintained by an individual (me) at this moment.

Platform-specific packages

Argo

Uses Kubernetes to run pipelines.

Kubeflow Pipelines

Works on top of Argo.

Oozie

Manages Hadoop jobs.

Azkaban

Manages Hadoop jobs.

References

Luigi

Gokart

Metaflow

Kedro

PipelineX

Airflow vs Luigi

Inaccuracies

Pull requests for https://github.com/Minyus/Python_Packages_for_Pipeline_Workflow/blob/master/README.md are welcome.

Data Scientist in Singapore. Author of ML Python packages: CausalLift (marketing) & PipelineX (pipeline for experimentation). linkedin.com/in/yusukeminami/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store