Apache Airflow — Programmatic platform for Data Pipelines and ETL/ELT workflows

Published in

Analytics Vidhya

5 min readApr 19, 2020

In current data driven world, Data Pipelines and ETL(Extract, Transform and Load) workflows plays a major role in collecting and handling data from different sources. Many data scientists and small companies rely on the quality and performance of these workflows to process and analyse the data in the downstream for business insights.

There are many ETL Platforms/Tools available in the market which can help in building simple and static workflows quicker and easier. Even though ETL platforms provide easy drag and drop UI components to build workflows, there will always be a question of whether to go for ETL platform or to build the workflow from scratch either for small company or a data scientist due to it’s cost or limitations. One of the biggest limitations of ETL tools is that they are mostly interface driven which makes it difficult to debug and navigate, also they introduce reproduciblity problem. Enough on the cons of ETL tools, I’ll leave it to you to explore on pros and cons of ETLs and let’s get into our main topic.

What is Apache Airflow?

From the website (https://airflow.apache.org/):

Airflow is a platform to programmatically author, schedule and monitor workflows.
Use airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

Basically with Airflow, we can create dynamic workflows by python scripting, schedule and monitor them. Airflow provides a decent UI for managing DAGs (Nothing but a workflow). Let’s see some key aspects of Airflow:

Airflow is scalable, extensible, elegant and has dynamic configuration for pipelines
It is open-source and entirely written in Python
It has plenty of integrations including Google Cloud Platform, Amazon Web Services, Microsoft Azure

How Airflow works?

The basic concept on which Airflow built is Directed Acyclic Graph (DAG). In computer science and mathematics, a DAG is a graph that is directed and without cycles connecting the other edges. This means that it is impossible to traverse the entire graph starting at one edge. In Airflow’s DAG, each node is a task representing a bash command or python function.

The DAG starts at ‘run_this_first’ node and ends at ‘join’ node. The task ‘join’ has dependency on all upstream tasks i.e., ‘join’ task will be executed only if all the nodes/tasks before ‘join’ task are executed. This makes airflow workflow more efficient and reliable.

Basic concepts:

DAG — Graphs of tasks/ usages.

OPERATOR — The operator refers to the transformation step, for example, bash_operator, python_operator. These operators are basically our building blocks of tasks.

Sensor — This type of operator performs a function to polls with frequency/timeout to monitor workflow

Executor — This type of operator performs trigger operations, for example, HiveOperator, Pig Operator.

TASK — Task is the main entity of the DAG. The main thing here is the task instance considered to run of a task at a point of time.

HOOK — It is considered as the Interface for the external System such as a hook of JDBC and HTTP.

Airflow is so simple that all it needs to create a DAG is Python programming knowledge. Airflow supports re-execution of tasks, if failed automatically as specified in the DAG arguments.

Simple Use-case

Let’s take a simple use-case of Airflow:

Problem Statement: We need to fetch data from a NEWS API source and do certain NLP transformations and store it in a Database for further analysis.

Solution:

Install and initialize Airflow web-server and scheduler
Write three python functions, one for connecting and fetching data from NEWS API source, second for transforming the fetched data and third for storing the transformed data in the database
Import Airflow and create a DAG with your own schedule interval and arguments
Create tasks using Python_Operator and three python functions
Create task dependencies and execute the file
Go to the web UI and start the DAG

That’s it, your DAG will be up and running.

Why Apache Airflow Matters?

The main reasons which signify the Importance of Apache Airflow are–

The most important advantage is that it provides the power of scheduling the analytics workflow and Data warehouse also managed under a single roof so that a comprehensive view accessed to check the status.
The logs entries of execution concentrated at one location.
The use of Airflow also matters as it has a strength to automate the development of workflows as it has an approach to configure the workflow as a code.
Within the DAGs, it provides a clarion picture of the dependencies.
The ability to generate the metadata gives an edge of regenerating distinctive uploads.

Apache Airflow supports integration with PySpark which allows it to be used in Big Data Pipelines as well. Refer the following figure

Source: https://medium.com/@natekupp/getting-started-the-3-stages-of-data-infrastructure-556dac82e825

Conclusion

In this post, I discussed on what is airflow, it’s pros and how Apache Airflow helps in building quick, dynamic yet efficient ETL workflows programmatically in Python. In my next post on Airflow, I will take a real-world example of an ETL workflow and show you how to setup the Airflow and develop the workflow in Airflow.