Apache Airflow for Workflow Management

Sahil Sharma
5 min readMay 28, 2023

--

Managing complicated workflows efficiently is critical in the quickly growing field of data engineering. With so many data sources and processing responsibilities, businesses want a powerful platform that can orchestrate, monitor, and automate these workflows seamlessly. Apache Airflow, a popular open-source platform that meets these needs, has emerged. In this article, we will look at what Apache Airflow is, how it may help with workflow management, and how its architecture and components work. In addition, we will walk you through the process of installing and configuring Airflow, as well as highlight some typical use cases with real coding examples.

What is Apache Airflow ?

Apache Airflow is an open-source framework for authoring, scheduling, and monitoring processes programmatically. It offers a versatile and adaptable framework for managing and orchestrating complex data pipelines. Data engineers may express their workflows as code in Airflow, making it easier to collaborate, version control, and recreate them.

Airflow’s Architecture and Components

The distributed architecture at the heart of Apache Airflow provides scalable and fault-tolerant workflow execution. Take a deeper look at its main components:

Scheduler

The scheduler is in charge of parsing and scheduling workflow tasks based on their dependencies and the schedule that has been specified. It guarantees that jobs are completed at the appropriate times and in the correct sequence.

Metadata Database

Airflow stores workflow configurations, task status, and execution logs in a metadata database (such as MySQL, PostgreSQL, or SQLite). This database serves as the central resource for process metadata management.

Executor

The executor is in charge of carrying out particular actions inside a process. Airflow has several executor options, including Sequential Executor (for single-node execution) and Celery Executor (for distributed execution across several workers).

Web Server

A user interface for engaging with Airflow is provided via the web server. It enables users to track process status, view logs, and initiate manual task executions.

Worker

Workers carry out the tasks delegated to them by the scheduler. To retrieve task configurations and report task status updates, they interface with the metadata database.

Setting up and Configuring Airflow

Installation

  • Ensure that you have Python and pip installed on your system. You can check the Python version by running python --version in your terminal.
  • Install Apache Airflow using pip.
pip install apache-airflow
  • Once the installation is complete, you can verify it by running airflow version in your terminal. It should display the installed Airflow version.

Configuration

  • Initialize the Airflow configuration by running the following command,
airflow initdb

This command initializes the Airflow database and sets up the necessary tables.

Navigate to the Airflow configuration file, which is typically located at ~/airflow/airflow.cfg. Open the file in a text editor.

Customize the configuration based on your requirements. Here are some important settings you may need to modify:

  • core: Set the dags_folder parameter to the directory where your DAG files will be stored.
  • core: Configure the executor parameter based on your environment. For example, you can use SequentialExecutor for a single-node setup or CeleryExecutor for distributed execution.
  • webserver: Set the web_server_port parameter to the desired port for the Airflow web server.
  • scheduler: Configure the dag_dir_list_interval parameter to specify how often the scheduler should scan for new DAG files.
  • smtp: Configure the email-related parameters if you want to enable email notifications.

Save the changes to the configuration file.

Starting Airflow Services

Start the Airflow web server by running the following command:

airflow webserver --port <port_number>

Replace <port_number> with the port you configured in the Airflow configuration file.

Open a new terminal window/tab and start the scheduler.

airflow scheduler

The scheduler is responsible for triggering task executions based on the defined schedule and dependencies.

By default, the Airflow web server can be accessed at http://localhost:<port_number>, where <port_number> is the port you specified in the configuration file.

Adding DAGs

To define and manage workflows in Airflow, you need to create DAG (Directed Acyclic Graph) files. A DAG file typically contains the tasks and their dependencies.

Here’s a basic example of a DAG file that executes a simple Python function as a task:

# Imports
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def my_python_function():
# Code to be executed as a task
pass

with DAG('my_dag', start_date=datetime(2023, 1, 1), schedule_interval='@daily') as dag:
my_task = PythonOperator(
task_id='my_task',
python_callable=my_python_function
)

In this example, we define a DAG named 'my_dag' that starts on January 1, 2023, and runs daily. The DAG contains a single task named 'my_task', which executes the my_python_function().

Save this DAG file in the dags_folder specified in the Airflow configuration. The scheduler will automatically detect the new DAG file and schedule the task executions accordingly.

Common Use Cases for Airflow in Data Engineering

Apache Airflow finds extensive application in data engineering for managing complex workflows. Here are some common use cases:

Data Pipeline Orchestration

Airflow excels in orchestrating data pipelines by defining dependencies between jobs and ensuring that they run smoothly. It enables data engineers to create end-to-end pipelines that integrate several data sources, transform data, and load it into target systems.

Data Processing Automation

Airflow allows you to automate repetitive data processing processes such as aggregations, transformations, and data validations. Airflow provides timely and accurate data processing by scheduling these processes and identifying dependencies.

ETL Workflows

Airflow streamlines ETL workflow management by providing a comprehensive framework for extracting data from numerous sources, performing transformations, and loading it into a data warehouse or analytics platform. It supports parallel processing, error management, and retries, ensuring that ETL operations are robust and reliable.

To demonstrate the power of Airflow, let’s consider a simple example of an ETL workflow that fetches data from an API, performs data transformations, and loads it into a database:

# Imports
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

def fetch_data():
# Code to fetch data from API

def transform_data():
# Code to perform data transformations

def load_data():
# Code to load data into a database

dag = DAG('etl_workflow', description='ETL workflow example', start_date=datetime(2023, 5, 27), schedule_interval='@daily')

fetch_task = PythonOperator(task_id='fetch_data', python_callable=fetch_data, dag=dag)
transform_task = PythonOperator(task_id='transform_data', python_callable=transform_data, dag=dag)
load_task = PythonOperator(task_id='load_data', python_callable=load_data, dag=dag)

fetch_task >> transform_task >> load_task

You may need to modify the code to meet your individual needs and data sources.

Wrapping Up

In this post, we explored that Apache Airflow is a robust solution for managing and orchestrating complicated data engineering activities. Its adaptable architecture, broad collection of components, and flexibility to create workflows as code make it an excellent choice for organisations working with big amounts of data.

Let me know in the comments, If you found this post useful, Follow me as I go on my content journey!

--

--

Sahil Sharma

|| Data Engineer || - || Big Data || Technology || AI & ML || CDE ||