Member-only story

Scaling DAG Creation With Apache Airflow

Ed Turner
Towards Data Science
5 min readJan 6, 2020
Photo by Campaign Creators on Unsplash

One of the more difficult tasks within the Data Science community is not designing a model to a well-constructed business problem or developing the code-base to operate in a scalable environment, but, rather, arranging the tasks in the ETL, or in the Data Science pipeline, executing the model on a periodic basis and automating everything in-between.

This is where Apache Airflow comes to the rescue! With the Airflow UI to display the task in a graph form, and with the ability to programmatically define your workflow to increase traceability, it is much easier to define and configure your Data Science workflow in production.

One difficulty still remains, though. There are circumstances when the same modelling, monolithic, process is utilized and applied to different data sources. To increase performance, it is better to have each of these processes run concurrently, rather than add them to the same dag.

No problem, let us simply create a dag for each process, all with similar tasks, and schedule them to run at the same time. If we were to follow along software development principle DRY, is there a way to create multiple different dags with the same-type tasks without having to manually create them?

Is there a way to create multiple different dags with the same-type tasks without having to manually…

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Ed Turner
Ed Turner

Written by Ed Turner

As a Senior Data Scientist, based in Tampa, FL, I am passionate about techonology and machine learning. Please ed-turner.github.io for more information about me

Responses (2)