Airflow and Data Engineers

Bazla Kausar
6 min readJun 20, 2023

--

Airflow was initiated by Airbnb as an open-source project in 2014,
irflow was accepted into the Apache Software Incubator Programme in 2016 and in 2019, it was declared a Top-Level Apache Project. It is now universally acknowledged as the best workflow management tool available.
Apache Airflow lets you programmatically author, schedule, and monitor your data pipelines using Python.

Users of Airflow can create workflows as task-based Directed Acyclic Graphs (DAGs). With Airflow’s sophisticated user interface, it’s simple to visualise pipelines that are currently in use, keep track of their progress, and address problems as they arise. When a task succeeds or fails, it can send an alert via email or Slack and links to different data sources. Given its distributed architecture, scalability, and flexibility, Airflow is a good choice for orchestrating complicated business logic.

Who can use Airflow ?

Data engineers can leverage Airflow’s support for hundreds of pre-built Airflow operators and thousands of Python modules and frameworks to build the data pipelines they need to collect, transfer, and modify data. Data engineers can create dependable data pipelines by managing job dependencies and recovering from errors with Airflow.

Why Airflow ?

  1. Versatility, Adaptability, and Scalability: Airflow can expand from snall deployments with a handful of users and data pipelines to enormous deployments with thousands of pipelines and many concurrent users. Being a Python based tool, it supports all of the libraries, frameworks, and modules that are available for Python.
  2. Ease of use: Airflow makes data pipeline creation easier by allowing customers to define their pipelines as Python code. The web-based user interface (UI) of Airflow streamlines task management, scheduling, and monitoring while offering quick glimpses into the efficiency and development of data pipelines.
  3. Open source: The platform is steered, improved, and supported by an extensive group of active maintainers, committers, and contributors who work on Airflow.
  4. Seamless Integrations: It is a tried-and-true option for businesses that needs robust, cloud-native workflow management. Asynchronous task support, data-aware scheduling, and tasks that adapt to input conditions are just a few of the useful new features that each new release of Airflow brings. These features give businesses even more freedom to design, implement, and manage their workflows.

What is Airflow used for ?

Data pipelines or workflows are scheduled and managed using Apache Airflow. The management, sequencing, coordination, and scheduling of intricate data pipelines from several sources is referred to as orchestration of data pipelines. These data pipelines produce data sets that are prepared for consumption by big data applications, data science, and machine learning models.

These workflows are shown as Directed Acyclic Graphs (DAG) in Airflow. To further understand what a workflow/DAG is, let’s make a burger.

Workflows typically have an objective, like producing visuals. The DAG now demonstrates how each step depends on a number of prerequisite actions that must be finished before. Like, you need raw vegetables, bread, sauces, and to chop and slice the vegetables. Similarly, you need the ingredients to prepare sauces.
Similarly, to create your visualisation, you need to move your data from relational databases to a data warehouse.

To be able to create more refined and accurate ML models thanks to efficient, economical, and well-organised data pipelines because these models have been trained on whole datasets rather than just tiny samples. Since Hive, Presto, and Spark are all big data platforms that Airflow is natively built to operate with, it makes for an excellent framework for managing jobs that use any of these engines. Airflow is being used by businesses to orchestrate their ETL/ELT tasks more often.

Unleashing Workflow Automation

Let’s start with a few base concepts of Airflow!

Airflow DAG

Workflows in Airflow just python files that are described by DAGs (Directed Acyclic Graphs). Although it is advised to have one DAG per file, a single DAG file may include numerous DAG definitions.

First, each DAG has a distinct dag_id that must be distinct throughout the whole Airflow deployment. In order to establish a DAG, we must also specify:

📍schedule_interval: This specifies the ideal time to run the DAG. It could be a cron expression(* * * * *) or a timedelta object, like timedelta(days=2).
If it is None, Airflow will not schedule the DAG; however, it may still be manually or externally activated.

📍start_date: a starting date (datetime object) from which the DAG will run. As a result, a DAG for earlier dates can be conducted. The days_ago function is frequently used to specify this number. The dag can still be manually activated if the date is in the future.

📍retries: It tells how many attempts should be made before giving up on an assignment

📍email_on_retry/email_on_failure : It can be set to True or False based on the need.

📍depends_on_past: When set to true, task instances will execute one after the other while depending on the schedule of the preceding task to complete. It is permitted to launch the start_date task instance.

We may begin adding jobs to our DAG once we have this baseline:

Each job in an Airflow DAG is defined by the operator (we’ll go into more specifics shortly), and each task has a task_id that must be distinct from other tasks in the DAG. A collection of dependencies for each activity establishes how it interacts with other tasks. These consist of:

📌 Tasks that will be completed before are known as upstream tasks.
📌 Tasks that will be completed after a task are known as downstream tasks.

In our case, task_a is downstream of task_b and task_c. And task_a is upstream of both task_b and task_c, respectively. The >> operator, which is applicable to both individual tasks and collections of tasks (such as lists or sets), is a typical technique to describe a relationship between tasks.

This is how a graphical representation of this DAG looks like:

graph view of a DAG

Task Instances

The tasks that make up a DAG are instantiated into Task Instances in a manner similar to how a DAG is created into a DAG Run each time it is executed.

A Task instance is a particular execution of a Task for a certain DAG (and hence for a distinct data period). They also serve as a representation of a Task with a state that indicates the stage of its lifespan.

state of a job
task lifecycle

Ideally, a task should flow from none, to scheduled, to queued, to running and finally to success.

Next Stop — Operators, Sensors, and Plugins!

In conclusion, this blog has provided a comprehensive overview of airflow and its significance in modern data engineering workflows. We have explored the fundamental concepts, architecture, and key features of airflow, highlighting its ability to streamline and automate complex data pipelines.

As we delve deeper into the world of airflow, it becomes apparent that there are numerous components that contribute to its versatility and functionality. Operators, sensors, and plugins play pivotal roles in enhancing the capabilities of airflow, allowing for seamless integration with various systems, databases, and services.

In our next blog, we will take a closer look at operators, sensors, and plugins. We will explore their functionalities, use cases, and how they can be leveraged to orchestrate intricate data workflows. Whether you’re an airflow beginner or an experienced user, this detailed examination will provide valuable insights to help you optimise and customise your airflow setups.

You may also like

https://airflow.apache.org/docs/apache-airflow/1.10.3/concepts.html
https://medium.com/@kausarbazla3/unlocking-the-power-of-elastic-search-apis-a-comprehensive-guide-96f30b29c99f
Mastering Elastic Search Index Management : Tips & Tricks
Elastic Search 101

--

--

Bazla Kausar

A data enthusiast, trying to be a productive organ of Technology.