Apache Airflow(五)Scale out with Celery Executor
Apache Airflow 2.1.2版本紀錄
目前工作上使用情境並不是因為計算而需要開多個Worker, 而是因為每分鐘執行的資料流越來越多了, 只有單一Worker處理太慢造成任務卡住, 因此開多個Worker來執行任務, 且在單一Worker壞掉的時候, 可以靠其他Worker繼續執行任務。
What is Celery Executor?
- Scale out Airflow
- Backed by Celery: Asynchronous Distributed Task Queue
- Distribute tasks among worker nodes
- airflow celery worker: Create a worker to execute tasks
- Horizontal Scaling
- High availability: If one worker goes down, Airflow can still schedule tasks
- message broker: Redis / RabbitMQ
- can specify tasks to different queues.
Architecture
Airflow consist of several components:
- Workers — Execute the assigned tasks
- Scheduler — Responsible for adding the necessary tasks to the queue
- Web server — HTTP Server provides access to DAG/task status information
- Database — Contains information about the status of tasks, DAGs, Variables, connections, etc.
- Celery — Queue mechanism
Please note that the queue at Celery consists of two components:
- Broker — Stores commands for execution
- Result backend — Stores status of completed commands
The components communicate with each other in many places
- [1] Web server → Workers — Fetches task execution logs
- [2] Web server → DAG files — Reveal the DAG structure
- [3] Web server → Database — Fetch the status of the tasks
- [4] Workers → DAG files — Reveal the DAG structure and execute the tasks
- [5] Workers → Database — Gets and stores information about connection configuration, variables and XCOM.
- [6] Workers → Celery’s result backend — Saves the status of tasks
- [7] Workers → Celery’s broker — Stores commands for execution
- [8] Scheduler → DAG files — Reveal the DAG structure and execute the tasks
- [9] Scheduler → Database — Store a DAG run and related tasks
- [10] Scheduler → Celery’s result backend — Gets information about the status of completed tasks
- [11] Scheduler → Celery’s broker — Put the commands to be executed
CeleryExecutor and Workers Quick Start
- 請先參考 Apache Airflow (二) 環境快速設定 建置環境。
2. 建立3個Workers。將 line 88~98 複製後貼上 改成 worker-2, worker-3之後docker-compose up -d
重啟便會建立多個worker。
3. 透過Flower UI 127.0.0.1:5555/ 觀察Celery, Worker是否正常運行
最簡單基本的用法, 當啟用CeleryExecutor與多個Worker後, Airflow會自動隨機將任務分散到不同的Worker執行, 若需要指定任務到指定的Worker或是任務執行優先順序的權重, 後續文章會做補充。
Reference
Udemy — the-complete-hands-on-course-to-master-apache-airflow_Apache Airflow