An Introduction to Apache Airflow

Hemant Vikani
Version 1
Published in
5 min readJan 30, 2023

Apache Airflow is an ETL tool which is open-sourced and used to programmatically author, schedule, and monitor workflows.

Architecture Overview

· A scheduler — Scheduler used to trigger scheduled workflows and assign tasks to the executors.

· An executor — Executor handles the execution of tasks. Everything runs inside the scheduler in a development or local environment, and production task execution is pushed to workers in a scaled setup.

· A webserver — Webserver used to inspect, trigger and debug the DAGs and tasks.

· DAG Directory contains DAG files which are accessed by scheduler/executor.

· Metastore database used to store state and used by the scheduler/executor and web server.

Architecture

How to install:

1. For Linux or Mac:

· Create a virtual environment, and then install Airflow using the below command on the terminal:

Command: pip install “apache-airflow[celery]==2.5.0” — constraint https://raw.githubusercontent.com/apache/airflow/constraints-2.5.0/constraints-3.7.txt

· Update the python version in the above command according to the python version in your machine.

· Set the AIRFLOW_HOME variable value as your current directory.

Command: export AIRFLOW_HOME = ‘Current working directory’

· Initialize the database

· Command: airflow db init

· Start Airflow Webserver

Command: airflow webserver -p 8080

· Go to the browser: http://0.0.0.0:8080

· Create user

Command: airflow users create –username admin –firstname firstname –lastname lastname –role admin –email admin@domain.com

· Set the password and go to the browser again

· Start the Airflow Scheduler

Command: airflow scheduler

· You will see all the example DAGs on UI.

2. For Windows using Docker:

· Create the directories locally.

Command: mkdir -p ./dags ./logs ./plugin

· Download docker compose file.

Command: curl -LfO ‘https://airflow.apache.org/docs/apache-airflow/2.5.0/docker-compose.yaml'

· Build image, if some new python requirements needs to be installed(requires Dockerfile).

Command: docker build -f Dockerfile –tag airflow-extending:latest .

· Compose

Command: docker compose up airflow-init

· Start docker

Command: docker-compose up -d

· Validate the UI

Go to: http://localhost:8080/ to see the DAGs scheduled, or you can run them manually. The default username and password will be airflow

· For stopping the container and removing the volumes

Command: docker-compose down –volumes –rmi all

Command: docker system prune -a

Note: Requires python >= 3.6

What exactly DAGs are?

DAGs are workflows created in python, where we can define and schedule the different tasks and mention the order of execution of tasks.

Sample diagram:

Three tasks in a sample_dag

How DAGs work:

Tasks have dependencies declared on each other in DAGs. Example: There are three tasks T1(Extract data), T2(Process data) and T3(Insights on data), T1 should run first, then T2, and next in the queue should be T3. So we will use the below syntax:

T1 >>T2>>T3 or T1.set_downstream([T2, T3])

For another way round:

T1 <<T2<<T3 or T1.set_upstream([T2,T1])

How to create DAGs:

There are different Airflow operators available for running python functions, bash scripts like Bash operator, and Python operator. We will go through a sample Dag below:

Sample DAG

Now, go to browser: http://0.0.0.0:8080, and see the DAG scheduled.

How can we deploy Airflow on Azure?

Designing an Airflow deployment for Azure will require different services for different Airflow components.

· Say, we will need Azure App services or Azure Container instance for Airflow Webserver and Airflow Scheduler.

· For Airflow metastore, we can use the Azure Sql database.

· For hosting DAGs, we can use Azure File Storage/Azure Blob Storage/Azure data lake storage

· Azure blob storage for data and logs.

In the above diagram, we can access the Airflow webserver remotely by exposing the webserver to the internet. However, to avoid exposing other components, such as the Airflow metastore and the Airflow scheduler, we want to keep them private. We can also add a firewall or an authentication layer (which can be integrated with Azure AD, etc.) in front of the web service by using the built-in functionality of app services, preventing unauthorized users from accessing the web server.

Key Features of Airflow

  • Dynamic Integration: To generate dynamic pipelines Airflow uses Python as the backend programming language.
  • Extensible: Airflow is an open-source platform, so users can define their custom operators, executors, and hooks.
  • Elegant User Interface: Airflow provides a user interface for debugging, and inspecting.
  • Scalable: Airflow provides scalability, we can define any number of workflows.
  • Failure Handling: Retry on Failure (Retry mechanism)
  • Backfill: Backfill through CLI (via cli or catchup)

Disadvantages

· Can’t run locally on Windows without using WSL or Docker.

· DAGs way of authoring workflow could pose challenges in building, testing and in keeping them maintainable over time in scenarios where Parallel execution of tasks and deep intuitive local testing of workflows are expected

· Airflow cannot be used for streaming.

· No versioning of data pipelines, if a task is deleted from DAG code and redeployed, you will lose the metadata related to that task.

About the Author:
Hemant Vikani is a Data Scientist here at Version 1.

--

--