Streamlining Bulk Task Processing with Apache Airflow (Part 1)
A Premier Workflow Orchestration Tool
Apache Airflow has established itself as a leading tool for workflow orchestration in the data engineering and data science space. It stands out for its flexibility, scalability, and a host of unique features that cater to the needs of complex task automation and data pipeline management. In this blog post, we’ll delve into what makes Airflow a standout tech tool and explore the unique features that set it apart from other workflow orchestration tools.
What is Apache Airflow?
Apache Airflow is an open-source platform designed to programmatically author, schedule, and monitor workflows. It was initially developed by AirbnbEng to manage the company’s complex data pipelines but has since grown into a widely adopted tool across various industries for orchestrating workflows.
Apache Airflow is a powerful workflow orchestration tool, but it’s not the only option available. Several competitors offer similar functionalities, each with its strengths and use cases.
In this blog post, we’ll explore some of Airflow’s main competitors, why Airflow often stands out, and how its task-based execution model can benefit your operations.
Major Competitors to Apache Airflow
- Luig — Developed by Spotify, Luigi is a Python module for building complex pipelines of batch jobs.
- Prefect — Prefect is a modern workflow orchestration tool that aims to simplify and streamline data engineering tasks.
- Dagster — Dagster is an open-source orchestrator for machine learning, analytics, and ETL.
- Kubeflow — Kubeflow is an open-source platform designed to deploy machine learning workflows on Kubernetes.
- Camunda — Camunda is a workflow and decision automation platform that uses BPMN (Business Process Model and Notation).
In the cutthroat world of telecom, maximizing efficiency is essential. At Deutsche Telekom Digital Labs, we design our systems and processes to flawlessly handle both individual transactions and large-scale operations. However, we’ve encountered situations where our systems struggled under the pressure of processing a high volume of orders at once.
- Seasonal Promotions: Telekom’s business departments might want to offer a Christmas gift, such as 30 days of Disney+ or free mobile data, to a preselected group of customers.
- Product Lifecycle Management: When retiring a product offering (PO), we need to systematically cancel it across all affected contracts.
- Product Replacement: Replacing a retired PO with a new offering across tens of thousands of contracts.
Manually handling these tasks through systems like MaVi is simply untenable. Imagine the tedium and error-prone nature of canceling or updating orders for 20,000 customers — delays and mistakes are inevitable. This is where Apache Airflow comes to the rescue. Airflow, a powerful open-source workflow orchestration tool, automates and streamlines bulk order processing, guaranteeing accuracy, speed, and reliability. By adopting Airflow, we can revolutionize how we manage large-scale order operations, boosting efficiency and elevating customer satisfaction.
In this blog, we’ll explore how Apache Airflow can be utilized to manage and automate bulk order processing at Deutsche Telekom Digital Labs, providing a scalable solution to meet our operational needs.
Getting Started with Apache Airflow: Creating a Simple Workflow for Order Processing
Apache Airflow isn’t just any open-source platform. It empowers you to design, schedule, and monitor workflows with code, ensuring seamless and efficient execution of complex tasks. This blog will be your guide to getting started with Airflow. We’ll walk you through installation, local executor setup, and building a basic order processing workflow. This workflow will cover crucial checks like customer contract details, eligibility, and location availability. We’ll also unveil the edge Airflow has over workflow tools like Camunda.
Installing Apache Airflow
Before we dive into creating workflows, we need to install Apache Airflow. The following steps guide you through the installation process:
Prerequisites
- Python (version 3.6 or above)
- pip (Python package installer)
Installation Steps
- Create a Python Virtual Environment:
python3 -m venv airflow_venv
source airflow_venv/bin/activate
2. Install Apache Airflow: Airflow requires certain environment variables to be set before installation. You can install Airflow with the Local Executor by running:
export AIRFLOW_HOME=~/airflow
pip install apache-airflow
3. Initialize the Airflow Database:
airflow db init
4. Create an Airflow User:
airflow users create \
--username admin \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@example.com
5. Start the Airflow Web Server and Scheduler:
airflow webserver --port 8080
airflow scheduler
You can access the Airflow web interface by navigating to http://localhost:8080
in your web browser.
Creating a Simple Workflow
Now that we have Airflow set up, let’s create a simple workflow that processes an order in four steps:
- Check Customer Contract Details
2. Validate Customer Eligibility
3. Check Location Availability
4. Create and Validate Order
Defining the Workflow
Create a new Python file, order_workflow.py
, in the dags
directory under AIRFLOW_HOME
:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
# Define default arguments
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
# Initialize the DAG
dag = DAG(
'order_processing_workflow',
default_args=default_args,
description='A simple order processing workflow',
schedule_interval=timedelta(days=1),
start_date=datetime(2023, 5, 20),
catchup=False,
)
# Define the tasks
def check_contract_details(**kwargs):
# Implement contract details checking logic
print("Checking customer contract details")
def validate_eligibility(**kwargs):
# Implement eligibility validation logic
print("Validating customer eligibility")
def check_location_availability(**kwargs):
# Implement location availability checking logic
print("Checking location availability")
def create_and_validate_order(**kwargs):
# Implement order creation and validation logic
print("Creating and validating order")
# Create PythonOperator tasks
task1 = PythonOperator(
task_id='check_contract_details',
python_callable=check_contract_details,
dag=dag,
)
task2 = PythonOperator(
task_id='validate_eligibility',
python_callable=validate_eligibility,
dag=dag,
)
task3 = PythonOperator(
task_id='check_location_availability',
python_callable=check_location_availability,
dag=dag,
)
task4 = PythonOperator(
task_id='create_and_validate_order',
python_callable=create_and_validate_order,
dag=dag,
)
# Set task dependencies
task1 >> task2 >> task3 >> task4
This script defines a DAG (Directed Acyclic Graph) with four tasks, each representing a step in the order processing workflow. The tasks are executed in sequence: check_contract_details
-> validate_eligibility
-> check_location_availability
-> create_and_validate_order
.
Why Apache Airflow Stands Out
- Dynamic Workflow Definition
- Python-based DAGs: Workflows in Airflow are defined as Directed Acyclic Graphs (DAGs) using Python code. This approach allows for dynamic creation of tasks and workflows, providing the flexibility to define workflows programmatically.
2. Extensive Integrations
- Built-in Operators: Airflow comes with a rich set of operators for common tasks, such as executing Bash commands, interacting with databases, and calling APIs. It also supports custom operators for more specific needs.
- Integration with Ecosystems: Airflow integrates seamlessly with various databases, cloud platforms (AWS, GCP, Azure), and big data tools (Hadoop, Spark), making it a versatile choice for diverse environments.
3. Scalability and Performance
- Executors: Airflow supports different executors, such as LocalExecutor, CeleryExecutor, and KubernetesExecutor, which allow tasks to be executed in parallel, on multiple nodes, or within Kubernetes clusters. This scalability is essential for handling large and complex workflows.
4. Robust Scheduling and Monitoring
- Flexible Scheduling: Airflow provides a flexible scheduling system that can handle a wide range of intervals and triggers for executing workflows.
- Monitoring and Alerting: The platform offers extensive monitoring capabilities through its web interface, which includes task status views, logs, and execution details. It also supports alerting mechanisms to notify users of workflow failures or issues.
5. Modularity and Extensibility
- Plugin System: Airflow’s plugin architecture allows for the addition of custom functionalities and integrations, enhancing its extensibility.
- Task Reusability: Tasks can be modular and reusable, allowing for consistent and maintainable workflow definitions.
Unique Features of Apache Airflow
- Directed Acyclic Graphs (DAGs)
- Clear Task Dependencies: Airflow’s use of DAGs ensures that tasks are executed in a clear and logical order based on defined dependencies. This structure is crucial for managing complex workflows where the execution order is vital.
- Dynamic DAG Generation: The ability to dynamically generate DAGs based on runtime conditions or external inputs provides unparalleled flexibility for adapting workflows to changing requirements.
2. Templating with Jinja
- Dynamic Parameters: Airflow allows the use of Jinja templating to pass dynamic parameters to tasks, making it easier to handle variable data and configurations within workflows.
3. Backfilling and Catchup
- Handling Missed Runs: Airflow’s backfilling feature allows for the automatic execution of past task instances that were not run, ensuring data consistency and completeness.
- Catchup Mechanism: The catchup feature ensures that all scheduled intervals are processed, even if the scheduler was down for a period, maintaining workflow integrity.
4. Task-Level Retry Policies
- Retry Mechanisms: Each task in Airflow can have its own retry policy, specifying the number of retries and delay between retries in case of failures. This granularity allows for robust error handling and increased workflow resilience.
5. Rich User Interface
- Visualization: The Airflow web UI provides a comprehensive visualization of DAGs, task dependencies, and execution statuses. This intuitive interface simplifies monitoring and debugging of workflows.
- Task Logs and Metadata: Detailed logs and metadata for each task instance are accessible through the UI, aiding in troubleshooting and performance analysis.
Ready to Streamline Your Order Processing?
By now, you’ve seen how Apache Airflow empowers you to build robust and automated workflows. Its Python-based approach makes it accessible and user-friendly, while its extensive ecosystem provides additional tools and integrations. Airflow goes beyond order processing — it tackles complex workflows across various domains. This flexibility, coupled with its scalability, positions Airflow as a superior choice compared to workflow technologies like Camunda.