Apache Airflow(五)Scale out with Celery Executor

Apache Airflow 2.1.2版本紀錄

目前工作上使用情境並不是因為計算而需要開多個Worker, 而是因為每分鐘執行的資料流越來越多了, 只有單一Worker處理太慢造成任務卡住, 因此開多個Worker來執行任務, 且在單一Worker壞掉的時候, 可以靠其他Worker繼續執行任務。

What is Celery Executor?

  • Scale out Airflow
  • Backed by Celery: Asynchronous Distributed Task Queue
  • Distribute tasks among worker nodes
  • airflow celery worker: Create a worker to execute tasks
  • Horizontal Scaling
  • High availability: If one worker goes down, Airflow can still schedule tasks
  • message broker: Redis / RabbitMQ
  • can specify tasks to different queues.

Architecture

Airflow consist of several components:

  • Workers — Execute the assigned tasks
  • Scheduler — Responsible for adding the necessary tasks to the queue
  • Web server — HTTP Server provides access to DAG/task status information
  • Database — Contains information about the status of tasks, DAGs, Variables, connections, etc.
  • Celery — Queue mechanism

Please note that the queue at Celery consists of two components:

  • Broker — Stores commands for execution
  • Result backend — Stores status of completed commands

The components communicate with each other in many places

  • [1] Web serverWorkers — Fetches task execution logs
  • [2] Web serverDAG files — Reveal the DAG structure
  • [3] Web serverDatabase — Fetch the status of the tasks
  • [4] WorkersDAG files — Reveal the DAG structure and execute the tasks
  • [5] WorkersDatabase — Gets and stores information about connection configuration, variables and XCOM.
  • [6] WorkersCelery’s result backend — Saves the status of tasks
  • [7] WorkersCelery’s broker — Stores commands for execution
  • [8] SchedulerDAG files — Reveal the DAG structure and execute the tasks
  • [9] SchedulerDatabase — Store a DAG run and related tasks
  • [10] SchedulerCelery’s result backend — Gets information about the status of completed tasks
  • [11] SchedulerCelery’s broker — Put the commands to be executed

CeleryExecutor and Workers Quick Start

  1. 請先參考 Apache Airflow (二) 環境快速設定 建置環境。

2. 建立3個Workers。將 line 88~98 複製後貼上 改成 worker-2, worker-3之後docker-compose up -d
重啟便會建立多個worker。

3. 透過Flower UI 127.0.0.1:5555/ 觀察Celery, Worker是否正常運行

最簡單基本的用法, 當啟用CeleryExecutor與多個Worker後, Airflow會自動隨機將任務分散到不同的Worker執行, 若需要指定任務到指定的Worker或是任務執行優先順序的權重, 後續文章會做補充。

Reference

Udemy — the-complete-hands-on-course-to-master-apache-airflow_Apache Airflow

--

--