Executors in Apache-Airflow

(A High-level Overview of Executors in Apache-Airflow)

Mukesh Kumar
Accredian

--

Written in collaboration with Hiren Rupchandani

Preface

Airflow executors are the mechanism that handles the running of tasks. Airflow can only have one executor configured at a time in the airflow.cfg file in the airflow’s root directory.

Primarily we can run executors in two modes, locally, and remotely, and they can execute tasks serially or parallelly. When using locally, the executor (local executor) can run tasks (or processes) on the same machine as the scheduler. The remote executors are distinct from local executors and help run jobs on separate devices from the scheduler.

Note: If you are unable to understand the complications of executors, do not worry. We will cover all the implementation parts for all the executors in the upcoming stories.

Airflow offers several built-in executors, which we can easily refer to by their names. These executors are as follows:

Sequential Executor

It is a lightweight local executor, which is available by default in airflow. It runs only one task instance at a time and is not production-ready. It can run with SQLite since SQLite does not support multiple connections. It is prone to single-point failure, and we can utilize it for debugging purposes.

Sequential Executor

Local Executor

Unlike the sequential executor, the local executor can run multiple task instances. Generally, we use MySQL or PostgreSQL databases with the local executor since they allow multiple connections, which helps us achieve parallelism. You can refer to airflow documentation on LocalExecutor to achieve the best equivalence in your pipeline designs.

Local Executor

Celery Executor

Celery is a task queue, which helps in distributing tasks across multiple celery workers. The Celery Executor distributes the workload from the main application onto multiple celery workers with the help of a message broker such as RabbitMQ or Redis. The executor publishes a request to execute a task in the queue, and one of several worker nodes picks up the request and runs as per the directions provided.

RabbitMQ is a message broker and provides communication between multiple services by operating message queues. It also offers an API for other services to publish and subscribe to the queues.

Celery Executor

MySQL or PostgreSQL database systems are required to set up the Celery Executor. It is a remote executor, which we use for horizontal scaling where workers get distributed across multiple machines in a pipeline. It also allows for real-time processing and task scheduling. It is fault-tolerant, unlike the local executors.

Kubernetes Executor

The Kubernetes Executor uses the Kubernetes API for resource optimization. It runs as a fixed-single Pod in the scheduler that only requires access to the Kubernetes API. A Pod is the smallest deployable object in Kubernetes.

When a DAG submits a task, the Kubernetes Executor requests a worker pod. The worker pod is called from the Kubernetes API, which runs the job, reports the result, and terminates once the job gets finished.

Kubernetes Executor

MySQL or PostgreSQL database systems are required to set up the Kubernetes Executor. It does not require additional components such as a message broker like Celery but a Kubernetes environment.

One of the advantages of Kubernetes Executors over the other executors is that Pods only run when tasks are required to be executed, which helps to save resources when there are no jobs to run. In other executors, the workers are statically configured and are running all the time, regardless of workloads.

It can automatically scale up to meet the workload requirements, scale down to zero (if no DAGs or tasks are running), and is fault-tolerant.

CeleryKubernetes Executor

It permits users to run Celery Executor and a Kubernetes Executor simultaneously. The CeleryKubernetes Executor documentation mentions a thumb rule on when to use this executor:

  1. If the scheduling of tasks exceeds the available resources and threshold, the Kubernetes cluster will handle the situation comfortably.
  2. A relatively small portion of your tasks requires runtime isolation.
  3. If you have plenty of small tasks that Celery workers can manage but also have resource-hungry jobs that will be better to run in predefined environments.
CeleryKubernetes Executor

Dask Executor

Dask is a parallel computing library in python whose architecture revolves around a sophisticated distributed task scheduler. Dask Executor allows you to run Airflow tasks in a Dask Distributed cluster. It acts as an alternative to the Celery Executor for horizontal scaling, which you implement with good understanding. You can refer to the difference between the Celery Executor and Dask Executor on this blog.

Dask Executor

These are the so far pre-defined executors that airflow supports. In the upcoming story, we will initiate and see how we can configure and use the Local Executor to run a DAG in Airflow.

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this story, I recommend you to go with the Global Certificate in Data Science because this one will cover your foundations plus machine learning algorithms (basic to advance).

& That’s it. I hope you liked the explanation of Executors in Apache-Airflow and learned something valuable. Please let me know in the comment section if you have anything to share with me. I would love to know your thoughts.

Follow me for more forthcoming articles based on Python, R, Data Science, Machine Learning, and Artificial Intelligence.

If you find this read helpful, then hit the Clap👏. Your encouragement will catalyze inspiration to keep me going and develop more valuable content.

What’s Next?

--

--

Mukesh Kumar
Accredian

Data Scientist, having a robust math background, skilled in predictive modeling, data processing, and mining strategies to solve challenging business problems.