A crash course on Apache Airflow Concepts

Published in

CodeByte

5 min readMar 5, 2022

Apache Airflow is an open-source workflow orchestration and scheduling platform. Airflow is extensible, pluggable and highly scalable. It is developed purely on python and therefore can be as versatile as the language itself.

Bear in mind though, it is not designed to run any kind of compute-heavy workload, rather it’s meant for running lightweight tasks on a schedule while the actual processing happens in external systems.

Let’s dive into the concepts and components that you need to understand if you wish to use Airflow in your projects.

DAG

In Airflow, workflows are defined as DAGs(Directed Acyclic Graphs). DAGs create a logical relationship among the group of tasks that a DAG is made up of.

Tasks

A task is the smallest unit of Airflow which signifies an action. Tasks are defined at a DAG level.

Operators

Operators are common Airflow utilities that are used to do some action. Tasks are defined in the form of an operator. For instance, task_1 might be using a PythonOperator which in turn will execute a python function and task_2 can be a BashOperator which is used to run a bash command directly from Airflow.

Airflow has a wide variety of existing provider packages which consists of operators from major cloud providers and technologies which airflow can leverage in its workflows.

You can even define a custom operator yourself for your specific use case and utilise that in your tasks.

Sensors

Sensors are just like Operators except they are waiting for some action to happen, either external or internal. For example, waiting for a row to be created in a database.

They run in 2 modes: poke and reschedule.

In poke mode, the sensor will run for some time to check for the target event to happen and will sleep for a predefined time called poke_interval.

Although, in the above mode the sensor is sleeping it’s still occupying some compute resources in the worker node. This is where reschedule mode comes into the picture, the task will check for the target event to happen, then it’ll stop running and reschedule itself after some time to check again, freeing up the resources between the two consecutive runs.

Schedule

The schedule is basically a CRON expression that defines how often your DAGs will run. This is defined for each DAG.

This is helpful if you have jobs that need to run regularly, for example, daily, weekly etc. You can set this to None if you wish to only run the dag manually.

Views

The Airflow user interface provides multiple ways for you to visualise your workflows/DAGs.

Tree view: In the tree view, you can visualise the historical runs of your DAGs as well as their status.

Graph view, which I personally like the most, gives you a nice representation of your tasks and their relationship. This also provides good insights into the live status of the DAG.

Calendar View: The calendar view gives information on the daily status of the DAG on a nice calendar, just like you can see your commits on GitHub. This will easily help you visualize your dags for the whole year.

Gantt View: This view allows you to see how many tasks are running at a single point in time. This can help you tweak the concurrency of your DAGs.

In Tree, Graph and Gantt views, you can click on any task to perform multiple actions or see the logs.

Connection

Connections are defined for Airflow to authenticate to external systems and APIs. These connections are in turn used by your tasks to perform actions on the external systems.

For example, You have a task that is extracting some data from a database. To perform this action, you’d need to connect to the database, which will require information like database endpoint, username and password. These are sensitive information and can be stored securely in the form of Airflow connections.

This is just one type of connection, you can create many types of connections in Airflow. Here’s how you define the details in a form like this while creating connections from the UI.

Hooks

Hooks contain some basic functionalities to connect to an external service and leverage Airflow connections so that you don’t have to write low-level code in your operators and DAGs.

For example, An S3Hook might contain various utilities such as listing S3 buckets, reading and writing files to a bucket and so on. By default, It’ll look for a connection with the name s3.

Providers

Providers are additional packages that you can install along with Airflow. They can contain Operators, Sensors and Hooks which enables you to work with multiple external services. Examples: AWS, GCP, Azure, Snowflake, Postgres etc.

Airflow providers have good community support and can be used for most basic and common operations.

Variable

Variables are a kind of storage option that gives you the feature to store information in the form of key-value pairs. You can assume it to be a small-scale document database of some sort. The underlying data is still stored in the Airflow Metadata database.

Best Practice:

Do not keep huge amounts of data in the variables.
Minimize the variable getting and setting operations.

Plugins

Plugins are custom python utilities that you can define in a directory, just like the DAG file and can be used in any of the DAGs. Generally, you’d want to put commonly used code here. This will keep your DAGs clean.

With this, I’ll finish the article here. We have covered most of the Airflow concepts on a high level, which is enough for you to get started. If you want more details on any of the concepts or want me to write in detail, be sure to let me know.