Airflow Architecture

Binaya Kumar Lenka
2 min readApr 2, 2023

--

Picture credit: Aurimas Griciลซnas ๐Ÿ‘
Picture credit: Aurimas Griciลซnas ๐Ÿ‘

๐Ÿ” ๐——๐—ผ ๐˜†๐—ผ๐˜‚ ๐—ธ๐—ป๐—ผ๐˜„ ๐—ต๐—ผ๐˜„ ๐—”๐—ฝ๐—ฎ๐—ฐ๐—ต๐—ฒ ๐—”๐—ถ๐—ฟ๐—ณ๐—น๐—ผ๐˜„ ๐˜„๐—ผ๐—ฟ๐—ธ๐˜€ ๐—ถ๐—ป๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐—น๐—น๐˜†?๐Ÿค”๐Ÿ’ป

Airflow is an open-source platform used to orchestrate complex data workflows. It is built around the concept of Directed Acyclic Graphs (DAGs), which define a series of tasks and their dependencies. Airflow is composed of several microservices that work together to execute these tasks. Hereโ€™s a simple explanation of the key components:

๐ŸŒ ๐—ช๐—ฒ๐—ฏ ๐—ฆ๐—ฒ๐—ฟ๐˜ƒ๐—ฒ๐—ฟ: This is the user interface of Airflow, where you can create, monitor, and manage DAGs. It provides an easy-to-use dashboard that helps you visualize your data workflows, check their progress, and troubleshoot any issues.

๐Ÿ•ฐ๏ธ ๐—ฆ๐—ฐ๐—ต๐—ฒ๐—ฑ๐˜‚๐—น๐—ฒ๐—ฟ: This component is responsible for managing the execution of tasks. It constantly monitors the DAGs youโ€™ve created and schedules tasks to run based on their dependencies and timing configurations. The Scheduler makes sure that tasks are executed in the right order and at the right time.

๐Ÿ”ง ๐—˜๐˜…๐—ฒ๐—ฐ๐˜‚๐˜๐—ผ๐—ฟ: The Executor is responsible for actually running the tasks. It communicates with the Scheduler to receive information about which tasks to run, and then it launches the necessary processes or containers to execute the tasks. There are different types of Executors in Airflow, such as LocalExecutor, CeleryExecutor, and KubernetesExecutor, depending on your infrastructure and requirements.

๐Ÿ‘ท ๐—ช๐—ผ๐—ฟ๐—ธ๐—ฒ๐—ฟ:The Worker is a component that performs the tasks assigned by the Executor. It can be a separate process or container, depending on the chosen Executor. Workers are responsible for executing the actual code or scripts defined in your tasks and reporting their status back to the Executor.

๐Ÿ’พ ๐— ๐—ฒ๐˜๐—ฎ๐—ฑ๐—ฎ๐˜๐—ฎ ๐——๐—ฎ๐˜๐—ฎ๐—ฏ๐—ฎ๐˜€๐—ฒ: This is the central repository where Airflow stores information about the DAGs, tasks, and their execution history. It helps maintain the state of your workflows and provides valuable data for monitoring and troubleshooting. Airflow supports various databases like PostgreSQL, MySQL, and SQLite for this purpose.

๐Ÿ“จ ๐— ๐—ฒ๐˜€๐˜€๐—ฎ๐—ด๐—ฒ ๐—•๐—ฟ๐—ผ๐—ธ๐—ฒ๐—ฟ (๐—ผ๐—ฝ๐˜๐—ถ๐—ผ๐—ป๐—ฎ๐—น): In distributed setups, where the CeleryExecutor is used, a message broker is needed to manage communication between the Scheduler and the Workers. The message broker, such as RabbitMQ or Redis, helps to pass task information from the Scheduler to the Workers and ensures reliable and efficient execution of tasks in a distributed environment.

Airflow is a powerful tool for managing data workflows, and understanding its architecture is key to ensuring its effective use in your organization. So, if youโ€™re looking for a reliable platform to manage your data engineering tasks, Airflow is definitely worth considering!

--

--

Binaya Kumar Lenka

Enterprise Architect | TOGAF 10 Certified Professional | Cloud & Big Data Solution Architect @ Skyworks