
Airflow is a great tool to schedule , visualize and execute a data pipeline.
but if you are like me , who have to manage about 100+ different pipelines, you will quickly realize that developing and managing these pipelines would require a bit of engineering…
to make the development process simple and manageable, we will look at 3 aspects of design pattern.
- separate repository for each project dag.
- use CI/CD to deploy code.
- use containers to execute code.
how to maintain different repositories for different pipelines?
DAGFactory : DAGFactory will collect and initialize DAGs from the individual projects under a specific folder , in my case it was @ airflow/project.
airflow
├── READEME.md
├── airflow-scheduler.pid
├── airflow.cfg
├── dags
│ └─ DAGFactory.py
├── global_operators
├── logs
├── projects
│ └── test_project
│ └── DAG.py
├── requirements.txt
├── scripts
├── tests
└── unittests.cfg- install it under
airflow/dags/DAGFactory.py
as airflow will initialize all the dags under airflow/dags dir. we will install DAGFactory here.
airflow
├── READEME.md
├── airflow-scheduler.pid
├── airflow.cfg
└── dags
└─ DAGFactory.pyCode for DAGFactory.py : Traverse through all the projects under airflow/projects and load the DAGS variable from DAG.py into airflow’s global namesapce.
Now you can have your code in a separate repository. All you have to do is
- have a
DAG.pyfile at the root level of your repo and - have a list by the name
DAGSwith all the main dags in it. ( yes, you can have multiple dags passed , so each dag will appear as a separate dag on the UI but still can be maintained from the same code base. )
How do you get project’s code into production airflow server/ service
- if you run airflow on a VM. you can either can manually checkout the project under
airflow/projects - you can use config management tool like
chefto deploy new projects on airflow VM. - you can use code-build or code-pipeline on AWS to build and deploy your kubernetes service. if you run airflow as a service on EKS. (or kubernetes engine on gcp)
I will write a article on each of these soon…
create complex DAGs v/s write complex code and execute in containers.
as the number of DAGs increase and their complexity increases. it takes more resources to execute them, that means if your prod airflow env is on VM you will have to scale up. ( note: if you use kubernetes to execute or any other executor this doesn't apply to you ).
to over come this I create container for each project and execute the container in ECS via airflow.
This lets me keep my airflow instance size to minimum. and the containers get executed on FARGATE and I only pay for the time they need to run.
and since using DAGFactory I am able to keep my code base separate. I can still have DAG.py and dockerfile all in the same repository. and still put DAG.py on airflow instance and docker container in ECR.
of course you cannot use airflow operators inside for your ETL or other processing. but you can still build your own library of modules and use them across different pipelines.
so why keep airflow?
well, using airflow I can track failures, kick of a job if you want to , in this case a container. and look at the logs in airflow. and details logs in cloudwatch.
so in short , no need to log in to any instances to debug issues.
