Apache Airflow 101 📓

A beginner’s guide to the basic concepts of Apache Airflow

Ovaiz Ali
7 min readOct 12, 2022

For a moment, let’s flashback to the year 2014. To incorporate data into daily reports and visualizations, your supervisor has requested to execute an ETL operation every day at midnight.

Your boss was pleased with your work for a few days because you developed a dependable CRON task in an abandoned instance. But one morning your supervisor taps your shoulder immediately after your employer was supposed to show the reports, and irately informs you that the reports didn’t get the data from yesterday.

You look through the CRON job on your instance, but nothing is there, leaving you perplexed.🤢

You learn later that your entire pipeline failed because one of the APIs you used to extract the data had a brief outage overnight.

One of the many unnoticed difficulties in developing an orchestration platform is simply that: Why do I know it failed and how? What do I mean when my tasks have complicated dependencies? What if I want to try again after failing for a few minutes? Is it possible to carry out several of these operations simultaneously? Does it consistently fail on the same day or time? When will it finish running?

To expect those characteristics from a straightforward CRON job is difficult!

However, Airflow goes above and beyond.😎

Here are the contents we would look into;

  • What’s Apache Airflow?
  • Why specifically Apache Airflow?
  • Downsides of Apache Airflow
  • Use cases
  • Fundamentals of Apache Airflow
  • How does Apache Airflow work?

Let’s get started! 🐱‍🏍

What’s Apache Airflow?

An open-source workflow management platform for data engineering pipelines which enables scheduling, reporting and error handling of the workflows.

But, why specifically Apache Airflow?

★ The order of your tasks is configured in a simple Python file, called a DAG, letting you define actions on both failure and success.

★ Retries are also part of this DAG file, allowing you to decide what to do if your process fails!

★ Easily report those failures and successes to your platform of choice! It’s easy to send reports through email or Slack messages.

★ It can manage extremely complicated task dependencies.

★ Airflow is scalable — It makes use of the celery executor, which enables the utilization of more worker computers. Additionally, it is compatible with the Kubernetes Executor, which creates a new pod for each instance.

★ And best of all: you have the power of Python in your hands, to dynamically create and modify any of these features.

Any downsides of Apache Airflow?

  • Python Dependent: While many people think that Airflow’s heavy reliance on Python programming is a good thing, others who are new to using Python may face a longer learning curve.
  • Glitches: Airflow is often reliable, however like any product, it occasionally has issues.

Use cases

The majority of known use cases for Airflow, which may be utilized for almost all batch data pipelines, involve big data-related initiatives. Following are a few examples of use cases found in the GitHub repository for Airflow:

  • Using Airflow with Google BigQuery to power a Data Studio dashboard.
  • Using Airflow to help architect and govern a data lake on AWS.
  • Using Airflow to tackle the upgrading of production while minimizing downtime.

Fundamentals of Apache Airflow

Let’s dive into the fundamentals of this robust platform.

Directed Acyclic Graph (DAG)

Directed acyclic graphs (DAGs), which are made up of tasks that need to be done and the dependencies that connect them, are used to describe workflows. Each DAG represents a set of tasks you want to do, and in Apache Airflow’s user interface, they display relationships between jobs. Let’s deconstruct the acronym:

  • Directed: Each activity that is dependent on another must have at least one of the designated upstream or downstream tasks.
  • Acyclic: Tasks cannot produce data that references themselves. By doing this, the potential for an infinite cycle is eliminated.
  • Graph: Tasks are arranged logically, with relationships between them and their processes made evident. A DAG, for instance, can be used to describe the connection between three tasks, X, Y, and Z. The phrase “execute Y only after X is executed, although Z can be independently executed at any time” could be used. We can also specify other restrictions, such as how many times a failed task should be attempted and when it should start.

Note: A DAG specifies how to carry out the jobs but not what each individual task accomplishes.

A typical DAG appears like this. It has four tasks — A, B, C, and D — and specifies their dependencies as well as the order in which they will be completed.

DAG run

A DAG run occurs when a DAG is carried out. Consider that a DAG is set to execute every hour. A DAG run is created with each instantiation of that DAG. A DAG run may have more than one DAG runs attached to it running concurrently.

Tasks

The complexity of tasks, which are instantiations of operators, varies. They can be shown as nodes in a DAG that represent units of work. The actual work that they reflect is specified by operators, and they represent the work that is done at each stage of your workflow.

Operators

Operators describe the job, while DAGs define the workflow. An operator functions as a kind of class or template for carrying out a specific activity. From BaseOperator, all operators are descended. There are operators for numerous common jobs, including:

  • PythonOperator
  • MySqlOperator
  • EmailOperator
  • BashOperator

These operators are used to specify actions to execute in Python, MySQL, email, or bash respectively.

There are three main types of operators:

  1. Operators act or request a different system to act.
  2. Operators that run until certain conditions are met.
  3. Operators move data from one system to another.

Sensors

Because they activate after an event, sensors are special operators. Smart sensors, reschedule, and poke (the default type of sensor) are available.

Hooks

Hooks enable Airflow to communicate with external systems. You can use hooks to connect to external databases and APIs, including MySQL, Hive, GCS, and others. For operators, they serve as building blocks. Hooks do not include any confidential information. It is kept in the encrypted metadata database of Airflow.

Relationships

Airflow excels at defining intricate connections between jobs. Consider the scenario where we want to specify that task t1 runs before task t2. We could define this relationship in terms of four alternative statements:

1. t2.set_upstream(t1)2. t1.set_downstream(t2)3. t1 >> t24. t2 << t1

Worker

The worker is the node or processor which runs the actual tasks.

Airflow Pipeline Example

How does Apache Airflow work?

Airflow being a basic queueing system is built on a metadata database. The scheduler prioritizes how new tasks are added to the queue based on the status of the queued tasks stored in the database. Apache Airflow has four main components that make up this robust and scalable workflow scheduling platform:

  1. Web Server

The user interface must be provided by the web server. Additionally, it enables reading logs from remote file storage and monitoring job status.

2. Scheduler

The scheduler is in charge of organizing the jobs; it chooses which tasks to carry out and when and where to do so. It also determines the order of execution.

3. Metastore/Database

All of the metadata about Airflow and our data is stored in a database called Metastore. It controls how various parts might communicate with one another and maintains data on the status of each task.

4. Executor

The executor, a closely related process to the scheduler, chooses the worker process that will carry out the task at hand.

Examples of executors:

  • SequentialExecutor: This executor can only work on one task at once. No parallel processing is possible. It is useful when testing or debugging.
  • LocalExecutor: Hyperthreading and parallelism are made possible by this executor. It’s excellent for using Airflow on a single node or a local workstation.
  • CeleryExecutor: The preferred method for managing a distributed Airflow cluster is this executor.
  • KubernetesExecutor: The Kubernetes API is invoked by this executor to create temporary pods for each of the task instances.

So, how does Airflow work?

At a specific time, Airflow reviews each DAG in the background. The processor poll interval configuration setting determines this duration, which is one second. DAG runs are made by the scheduling settings after a DAG file has been inspected. For jobs that must be completed, task instances are created and their metadata database status is set to SCHEDULED.

Tasks in the SCHEDULED state are retrieved from the database by the schedule, which then distributes to the executors. The task’s status then becomes QUEUED. Workers select those tasks from the queue and carry them out. The task status switches to RUNNING as a result.

The scheduler updates the task’s final state in the metadata database after the worker marks a task as completed or failed.

Next steps for working with Apache Airflow

You are now prepared enough with Apache Airflow’s fundamentals. Building anything using this tool is an excellent method to learn how to use it. You can create your project or contribute to an online open-source project after downloading Airflow.

There’s still so much more to learn about Airflow.

Happy learning! 😊

Let me know if anything is unclear and whether you have any questions by leaving a comment. I’ll be checking the comment section of this post regularly.

If you enjoyed this piece, please make sure to follow my profile, so you don’t miss any of my upcoming articles.

If you want to take it to the next step, make sure to connect with me on LinkedIn, GitHub, and Kaggle.

You can also Subscribe 👍 to my YouTube channel and explore the Data World on a wider scope.

Best Wishes,
Ovaiz Ali❤️

--

--

Ovaiz Ali

💻 Junior Consultant — Data Analytics @ Systems Ltd | Data Engineering | FAST’22 | YT @ Coderz Den | Connect Via https://www.linkedin.com/in/ovaiz-ali