Getting started with Apache Airflow

5 min readFeb 22, 2022

Have you ever wanted a tool that allows you to run, manage and monitor your processes from different platforms in one place? Well… you came to the right place.

What is Apache Airflow?

Apache Airflow is an open-source project that was created in 2014 in Airbnb by Maxime Beauchemin, and published in June 2015.

Apache Airflow is a workflow manager tool that allows us to manage, monitor and schedule workflows, used as a service orchestrator. It is used to programmatically automate jobs by dividing them into subtasks.

The most common use cases are the automation of data ingestions, periodic maintenance actions and administration tasks, but the sky is the limit.

The main principles of Apache Airflow are:

Scalable: It means that it is ready to scale to infinity if needed, since it uses a message queue to orchestrate an arbitrary number of workers.
Dynamic: Since pipelines are defined in Python, it allows dynamic pipeline generation (we will cover this topic in another article).
Extensible: It allows to define our own operators and extend libraries.
Elegant: Pipelines are lean and explicit. Parametrization is built into its core.

The main object of Apache Airflow is called a DAG (Directed Acyclic Graph) that we will cover in detail in the next section.

Apache Airflow Architecture²

Apache Airflow has four main components:

Web server that hosts the API and the user interface and also manage requests.
Scheduler in charge of interpreting, launching workers and distributing tasks among them with an executor to run and monitoring the tasks defined in the DAGs.
Workers in charge of executing the tasks sent by the scheduler.
A database that acts as a backend in charge of storing metadata, users and executions.

What is a DAG?

As we previously mentioned, a DAG is an acronym for Directed Acyclic Graph, but what does that really mean?

A DAG is a collection of tasks or jobs nodes (graphs) programmed through Python code that are connected by relationships and dependencies that should fulfill the following:

Directed: It means that they should be one-way relationships. From start to end.
Acyclic: Means that they cannot loop, that is, the execution cannot return to a node that has already been executed.

Each of the DAG tasks represented as a node, is described with an operator. There are hundreds of predefined operators, but, as mentioned before, it is also possible to extend and create new operators if necessary. Airflow also provides communication interfaces called Hooks to connect to other platforms and external databases such as Oracle, MySQl, SQL Server, Snowflake, BigQuery, HDFS, Apache Hive, etc.

Example of a more complex DAG representation

Apache Airflow DAGs — Code main structure

A simple DAG is made up of three main sections:

Libraries/operators import: in this section, all needed libraries and operators should be explicitly imported. Lines from 1–6 in image below.
DAG and Operators definition: in this section, the operators (tasks) are defined. Take in consideration that they can be defined in any order. Lines from 14–22 in image below.
Workflow construction: in this section is defined the order in which operators (tasks) will be executed. Line 24 in image below.

This code will result in a DAG that starts with “start_dag” task that corresponds to a Dummy Operator, followed by “task2” that corresponds to a Bash Operator that executes a “echo 1” command.

Personal experience

As you can see, Apache Airflow is a very powerful tool that has hundreds of operators (and developing more) to execute tasks in all the cutting-edge platforms. It has operators for AWS S3, BigQuery, Snowflake, SFTP, FTP, Oracle, MySql, SQL Server, Google Cloud Data Fusion, Tableau, Bash (OS), File, Email, Python (functions) and more.

In my experience, I have used Apache Airflow to:

Execute data pipelines in Google Cloud Data Fusion to extract data from on-prem sources like MySql, Sql Server, Flat files, etc.
Load data warehouse tables in BigQuery with custom Data Vault 2.0 patterns, creating dynamic DAGs.
Load data warehouse tables in Snowflake with custom Data Vault 2.0 patterns, creating dynamic DAGs.
Upload query results from Oracle to AWS S3 buckets in CSV and Parquet format.
Upload files and creating “folder” structures in AWS S3 buckets from CSV files located on a SFTP server.
Retrieve marketing data from Facebook Ads, Google Ads and LinkedIn to load our data lake.
Create data delivery tables in BigQuery from custom SQL queries.
Refresh tableau workbooks/extracts.
Schedule all my workflows at specific times.
Execute user-defined python functions.
And more.

With everything listed before, I’m sure that Airflow is an excellent tool that allows data engineers like me to have better control over the processes throughout the data processing flow and also to reduce additional costs that could be generated by acquiring specialized tools.

I do not regret for a moment that I started working with this great tool 1 year ago and I will certainly be happy to share my work with you in future articles. And I know that if you have different platforms that you need to consume or access for your data processing, you won’t regret it either, once you try it.

References

[1] Apache Airflow. Apache Airflow. https://airflow.apache.org/

[2] Apache Airflow. Architecture Overview. https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html

[3] Aprender BIG DATA (February 16, 2022). ¿Qué es Apache Airflow? Introducción. https://aprenderbigdata.com/apache-airflow/