Airflow Design Pattern to manage multiple airflow projects

Bhavin
Bhavin
Nov 5 · 3 min read

Airflow is a great tool to schedule , visualize and execute a data pipeline.

but if you are like me , who have to manage about 100+ different pipelines, you will quickly realize that developing and managing these pipelines would require a bit of engineering…

to make the development process simple and manageable, we will look at 3 aspects of design pattern.

  1. separate repository for each project dag.
  2. use CI/CD to deploy code.
  3. use containers to execute code.

how to maintain different repositories for different pipelines?

DAGFactory : DAGFactory will collect and initialize DAGs from the individual projects under a specific folder , in my case it was @ airflow/project.

airflow
├── READEME.md
├── airflow-scheduler.pid
├── airflow.cfg
├── dags
│ └─ DAGFactory.py
├── global_operators
├── logs
├── projects
│ └── test_project
│ └── DAG.py
├── requirements.txt
├── scripts
├── tests
└── unittests.cfg
  • install it under airflow/dags/DAGFactory.py

as airflow will initialize all the dags under airflow/dags dir. we will install DAGFactory here.

airflow
├── READEME.md
├── airflow-scheduler.pid
├── airflow.cfg
└── dags
└─ DAGFactory.py

Code for DAGFactory.py : Traverse through all the projects under airflow/projects and load the DAGS variable from DAG.py into airflow’s global namesapce.

Now you can have your code in a separate repository. All you have to do is

  1. have a DAG.py file at the root level of your repo and
  2. have a list by the nameDAGS with all the main dags in it. ( yes, you can have multiple dags passed , so each dag will appear as a separate dag on the UI but still can be maintained from the same code base. )

How do you get project’s code into production airflow server/ service

  1. if you run airflow on a VM. you can either can manually checkout the project under airflow/projects
  2. you can use config management tool like chef to deploy new projects on airflow VM.
  3. you can use code-build or code-pipeline on AWS to build and deploy your kubernetes service. if you run airflow as a service on EKS. (or kubernetes engine on gcp)

I will write a article on each of these soon…

create complex DAGs v/s write complex code and execute in containers.

as the number of DAGs increase and their complexity increases. it takes more resources to execute them, that means if your prod airflow env is on VM you will have to scale up. ( note: if you use kubernetes to execute or any other executor this doesn't apply to you ).

to over come this I create container for each project and execute the container in ECS via airflow.

This lets me keep my airflow instance size to minimum. and the containers get executed on FARGATE and I only pay for the time they need to run.

and since using DAGFactory I am able to keep my code base separate. I can still have DAG.py and dockerfile all in the same repository. and still put DAG.py on airflow instance and docker container in ECR.

of course you cannot use airflow operators inside for your ETL or other processing. but you can still build your own library of modules and use them across different pipelines.

so why keep airflow?

well, using airflow I can track failures, kick of a job if you want to , in this case a container. and look at the logs in airflow. and details logs in cloudwatch.

so in short , no need to log in to any instances to debug issues.

Bhavin

Written by

Bhavin

❤ data : engineer for life : data science enthusiast.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade