Apache Airflow : A Basic Introduction
What is Apache Airflow ?
Apache Airflow is a framework which is being used for programmatically schedule and monitor the data pipeline. It runs different-different tasks with their defined execution dependencies.
Execution Dependencies :- It helps to priorities the task, Suppose task1 and task2 is two different tasks. Suppose task2 is highest priority task then with the help of Airflow we set the task dependencies and whenever the scheduler will run firstly the task2 will be initiated then task1 will run.
Apache Airflow is being primarily developed by Airbnb for their internal data Engineering pipeline. Nowadays more than 500 reputed organizations are using this framework.
Advantage of Apache Airflow Over Conventional Scheduler?
Conventional Scheduler works pretty well with simple ETL pipeline but when it comes to complex data pipelining there are certain challenges comes with conventional Scheduler hence we need some robust framework to deal with complex ETL pipeline here Airflow comes into picture.
- Error Handling :- Suppose you have ran an ETL task and it fails. Airflow gives us the flexibility to run the task multiple times if it fails to run.
- Execution Dependencies :- Suppose You have Ran two Tasks i.e. :- Task A and Task B. Let Task B is Dependent on Task A . So firstly Task A will run then Task B will run.
Refer to Above figure Suppose in a conventional scheduler you scheduled Task A at 8pm (which takes 2 hours to execute). Then after half an hour gap Task B will run but due to some reason Task A took 2.25 hours (8pm-10:45 pm) to run and as we have scheduled the Task B at 10:30 pm hence Task B will start executing before completion of Task A and will be resulted into error(fails in execution). These Kind of dependencies id handled in Airflow. The Task B will only be executed once the Task A will run Successful.
3. Transparency :- We cannot see the logs of execution in conventional scheduler but Airflow gives the execution logs/history at UI itself.
4. Task Tracking :- We can easily trace which task has taken longer time to execute. Which task has taken several attempt for a single successful execution. Even we can trigger a mail if some particular task will not run successfully after several attempts.
Common Terminology in Airflow
- DAG/ Workflow :- Directed Acyclic Graph
Directed Acyclic Graph is unidirectional in nature . Refer to above figure we can go from Node1 to Node9 but cannot roll back to Node1. Here each Nodes represent the Task and edges represent the dependencies. With the help of DAG we can decide which task can run in parallel and which task should run in sequence. DAG only decides how to run workflow, it does not run actual computation.
2. Operators :- Operators are responsible for computational task like running the bash command , executing the python functions etc.
3. Tasks :- Once the operators is instantiated it is referred as task.
Please hang tight for more blog related to Apache Airflow. If You have any suggestions and Questions then please comment below.