Data Engineering concepts: Part 5, Data Orchestration
This is Part 5 of my 10 part series of Data Engineering concepts. And in this part, we will discuss about Data Orchestration.
Contents:
1. Data Orchestration
2. Apache Airflow
Here is the link to my previous part on Data Pipelines:
What is Data Orchestration?
Data orchestration is the automated process of authoring, scheduling and monitoring data pipelines. Previously, cron jobs were used to schedule the data pipelines but as the complexity of pipelines increased, we needed ways to orchestrate programmingly, hence enter orchestration tools.
The first generation of orchestration tools were focused on decoupling the task management from the task process. Where when one task completes, the next would start. There is no information about what the task process is implementing. eg. Luigi, Airflow
The second generation of orchestration tools were designed to be data-aware where if some type of data is available/ready then you would execute the next task and also understand the different parts of tasks. eg. Prefect, Flyte
Apache Airflow
Apache Airflow is the most famous open source data orchestration tool used to create workflows in Python to run, schedule and monitor data pipelines. It is easy to use and has a interactive graphical UI.
The main components of the Airflow system are:
1. Tasks- these are the action items to be execution like Python functions
2. Operators- predefined set of tasks eg.BashOperator
3. Sensors- a type of operator that waits for an event to occur before execution
4. DAGs- A DAG(directed acyclic graph) contain a series of tasks to be executed in a given order and their dependencies
5. Scheduler- A scheduler handles the scheduled workflows and sends Tasks to Executor
6. Executor- The executor pushes the task to workers
7. Worker- Tasks can be run by workers in parallel
8. Plugins- These allow custom operators, hooks to be added to the DAG
9. WebServer — it is used to inspect bugs in the DAG execution
What are the key features of Airflow?
The key features of Airflow are:
1. Datasets and Data aware scheduling — You can chain datasets and automatically change datasets based on changes in upstream changes.
2. Full REST API- You can build programmatic services using the Airflow REST API for Airflow environment services.
3. Deferrable operators: Accomodate long running tasks with deferrable operators that help free up worker space while tasks run asyncronously
4. Highly available scheduler: You can run a high number of parrallel tasks and scheduler replicas will help increase task throughput
5. Dynamic task mapping: Dynamic task mapping can help you create tasks as per the output of upstream tasks without having the DAG author to know number of tasks beforehand.
6. Task Flow API: Instead of using operators, if you simply define Python functions, they can used as tasks as shown below( get_ip() and compose_email()) and automatically declare get_ip() as the downstream of compose_email()
from airflow.decorators import task
from airflow.operators.email import EmailOperator
@task
def get_ip():
return my_ip_service.get_main_ip()
@task(multiple_outputs=True)
def compose_email(external_ip):
return {
'subject':f'Server connected from {external_ip}',
'body': f'Your server executing Airflow is connected from the external IP {external_ip}<br>'
}
email_info = compose_email(get_ip())
EmailOperator(
task_id='send_email',
to='example@example.com',
subject=email_info['subject'],
html_content=email_info['body']
)