A simple introduction to Apache Airflow and Luigi

Duy Nguyen
4 min readMay 24, 2020
Source: wikipedia.org

A simple introduction to Luigi and Airflow from Airflow user.

I have been working with Apache Airflow for more than 3 years to build and maintain Big Data pipelines. Accidentally I have a chance to use Luigi in my next project. I have spent for few hours to learn it and compare it with my Airflow experience. In this post, I will introduce you about Luigi and Airflow, some highlighted feature and small comperations

TL;DR

  • Apache Airflow offers you full-stack build scalable and flexibility pipelines data pipeline. It has awesome, easy-to-use UI. It also offers you data pipelines scheduling and triggering. But its deployment is not easy due to its powerful
  • Luigi only focuses to build complex pipelines. It only has simplest UI and visualization. Lacking of scheduler and trigger, you need cronjob or any scheduler solution to schedule pipelines. It also comes with Hadoop support built in. The deployment is very easy and straightforward.

What is data pipeline

It is simply a set of steps, and their dependencies. Usually we have many steps which depends on data for previous steps. Please check below illustration to get better understanding

This is very simple of pipeline, in real life the dependencies is much more complicated with dozen to hundred of steps.

What make Airflow the best choice of data pipeline management?

Developed at Airbnb and then open-source in 2015

In Airflow, a DAG — or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.

Airflow contains 3 main components

  • Web server: We you can see your tasks, visualization and configurations…
  • Scheduler: It is responsible for scheduling your tasks according to the frequency mentioned
  • Worker: It is responsible for actually running a task.

Beside that you need a database to store its metadata. The database should be any of supported database from sqlalchemy.

You can deploy these components separately, especially Worker to increase scalability.

Airflow pipeline is building around Sensor and Operator:

  • Sensor: An action of waiting for something; e.g file appears in S3/HDFS or a record appears in Database or Cache…
  • Operator: An action of doing something; e.g trigger API to dump data, pulling data from PostgreSQL, creating new EMR cluster…

Airflow has very strong community, they are actively adding more and more feature to Airflow.

Learn more about Airflow: https://airflow.apache.org/docs/stable/

Introduction to Luigi

Developed internally at Spotify and then open-source in 2012.

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc.

Tasks are where the execution takes place. Tasks depend on each other and output targets.

Luigi task breakdown

There are 3 mains parts for one Luigi task

  • requires: Dependencies need to be finish before task run. Come back with my previous sample about data pipeline, CustomerOrderAggregationTask will have requires [DumpCustomerTask(), DumpOrderTask()]
  • output: Target for next step, the target can be Luigi built-in or custom target which extends luigi.target.Target. Let take CustomerOrderAggregationTask example, as you saw possible output of this step can be S3Target, and this target will be used in next step which is MergeCustomerTask
  • run: The main logic of the task, for example CustomerOrderAggregationTask, it will read data from customer and order, do some aggregations and save the result to S3.

Unlike Airflow, you can embed Luigi task to any of your project and can call tasks by python code, cronjob or CLI.

You can run Luigi in local mode and it doesn’t require any server.

So what are features Luigi doesn’t have but Airflow

  • Scheduler/Trigger: You must setup cronjob or Jenkins or similar scheduler solution
  • Detailed, easy-to-use interface: Luigi provide very simple interface.
  • Scalability: It is easy to scale up Airflow with more workers, but hard to achieve it in Luigi.

Learn more about Luigi: https://luigi.readthedocs.io/en/stable/

Conclusion

If you ask me, I would prefer Airflow over Luigi. As I mentioned at the beginning, Airflow offers something called full-stack, with Airflow you have everything to build and maintain your data pipelines. Moreover with the large community, it supports large numbers of features. You can see there are more are more features added every release.

But if your data pipelines is mostly working with Hadoop, and want to deploy the simple workflow management, I would recommend Luigi because its simplicity, easy to deploy and maintain. But Luigi is used internally at Spotify to run thousands of tasks every day, organized in complex dependency graphs. It is not easy to say tasks at Sportify are simple tasks :)

Next post, I will go more detail about them by creating some data pipelines using both solutions, with that we can have better understanding, easily to compare and see the differences between them.

Hope you enjoy this post, free free to your correction/comment or any suggestion.

--

--