Airflow on Kubernetes at Scale for Data Engineering (Dependencies Simplified)

6 min readFeb 7, 2023

People and companies are continuously becoming more data-driven and are developing processes/data pipelines as part of their daily business requirements. The data volumes involved in these business processes have increased substantially from megabytes per day to gigabytes per second. With the rise of complex data solutions, automating and orchestrating data processes is becoming increasingly imperative.

Workflow orchestration is one of the most important pillars in today’s data architecture. With the advent of modern orchestration tools like airflow, dagster, prefect, etc., the challenge is very much solvable. Still, running these at scale requires a good understanding of these solutions. We will be leveraging “Airflow” as an example solution to explain key salient features that enable you to quickly build scheduled data pipelines using a flexible Python framework while also providing many building blocks that allow you to stitch together the many different technologies encountered in modern technological landscapes. Various organizations heavily rely upon workflow orchestration with Airflow in the Data space as it offers capabilities like open source, flexibility in terms of deployment, Scalability (horizontal & vertical), running efficiently in any cloud ecosystem, and supporting complex data processes.

Below are some recommendations based on our experiences with Airflow on how to run it at scale.

Containerize your Airflow Deployment

We have learned with experience that the best way to scale your Airflow deployment is by taking advantage of containerization and running Airflow on Kubernetes(K8).

The container-based deployment pattern gives you built-in reusability — your Airflow tasks will run in K8s pods, and DAG code is managed centrally via a sync in Github, simplifying versioning and maintenance. It is a proven way to deploy, maintain, and scale Airflow.

A scheduler that handles both triggering scheduled workflows and submitting tasks to the executor to run.
An executor, which handles running tasks. In the default Airflow installation, this runs everything inside the scheduler, but most production-suitable executors push task execution out to workers.
A webserver presents a handy user interface to inspect, trigger and debug the behavior of DAGs and tasks.
A folder of DAG files, read by the scheduler and executor
A metadata database is used by the scheduler, executor, and webserver to store the state.

Modernization Strategy:

Airflow is continuously maturing, and not keeping up with the latest will result in challenges like increasingly difficult to scale, no seamless way for DAGs to upgrade to the latest version, security vulnerabilities, and with time increasingly expensive to maintain.

It’s always recommended to get current and keep current on Airflow and to make sure that the way you design your orchestration platform makes it simple and possible to scale and leverage the new features. For instance, Airflow 2.0 introduced new, comprehensive REST API features that set a strong foundation for different requirements like “Event Based Dependency” from external systems, trigger DAGs across deployments (useful for ML engineering), and so on.

Please have custom utility operators, standard tasks, and DAG structure to promote reusability and standardization with Airflow across your organization.

Flavors of Orchestration with Simplified Dependencies:

As we build data pipelines to cater to many clients in the data ecosystem, we might have to orchestrate the flow end to end in a seamless way. Let's take any large enterprises with a Datawarehousing setup. They are bound to have workflow management capabilities and leverage that to orchestrate Datalake Ecosystem workflows or leverage Cloud offerings like GCP Orchestrator/AWS MWAA. When we started our Cloud Datalake journey, we had to leverage our managed Airflow setup. We had to solve many problems like how we integrate our cloud Datalake Workflows with existing traditional orchestrator/on-perm workflow management tools, how we orchestrate workflows that need to be scheduled in a way that is beyond cron style orchestrations, how we optimize the orchestrator slots for workloads and visualize the dependencies across workloads. It is vital to think upfront about these challenges rather than later.

Airflow Deployment Architecture on Kubernetes

For most of the cases, time triggered approach that assumed data availability works. One of the main drawbacks of this is that we are losing a certain time for every integrated dataset, which balloons up the end-to-end pipeline run duration. A typical time-driven approach might not be ideal for SLA-driven workloads, and we leverage in-house event-driven systems that integrate with on-perm and traditional orchestrators. This helped us even to simplify the micro and macro orchestration needs in our ecosystem, enterprise calendar-driven workflow needs, etc.

CI/CD with Airflow:

Regarding the Code of the DAGs, we can standardize and make the code so it can be driven through configuration. For very large data integrations, automated generation of DAG code becomes vital to save time and have a standardized code that embeds all logging and notification callbacks integrated. As the Data ecosystem matures, this automated code generation helps leverage some of the new capabilities of orchestrators and standards we might have to embed in our workflows at a later stage. Based on the organization's existing capabilities, we can set up robust CI/CD pipelines tailored to our needs that embed linting check and delivers the final DAG code to the Object store or Compute Node’s path that is mounted as the DAG path for Orchestrator.

Monitoring & Alerting:

Many Large Enterprises already/or should have a Robust Process in place for Ticketing(Service Now), Altering(PagerDuty), and Notifications(Teams/Slack) and it’s vital that the Data Lake Operations are tied to those systems in a seamless way. This integration can also be leveraged to interact with the upstream teams/third-party vendors/end users that might not have direct access to our Data Lake Ecosystem.

Fortunately, many of these orchestration tools offer an API-driven way to integrate with them. We leverage an event-driven approach and have a standardized way for all the workloads to notify the concerned teams and open tickets in their queues if necessary for Workflow Failures/SLA Breaches/Long Running Jobs/Data Availability. This lightweight layer that sits on top of airflow and many other tools in our ecosystem can help meditate the Severity, Formating/Parsing the Workload Logs, and summarizing and making the Notifications and Alerting much more meaningful and actionable to the Operations Team. We have decoupled this layer and made it a separate entity in our Tech Stack that can be scaled and operated independently. This layer can be extended as per your enterprise requirements. Airflow and other popular Data Orchestration tools allow us to standardize and hook up these callback functions to interact with this layer as stated above.

To summarize, this article explains the implementation of an orchestration tool in your ecosystem with Airflow as an example. This article also gives you the complete picture of how to run your data ecosystem at scale with open-data architecture. Please refer to the below two blogs to get a complete end-to-end picture.

For more details on “Composable Data Architecture,” please refer to our earlier blog. (https://medium.com/@DataEnthusiast/open-data-architecture-at-scale-on-cloud-part-1-3381b411533f)

For more details on “Designing Compute & Storage for Composable Open Data Architecture on Cloud,” please refer to our earlier blog. (https://medium.com/@DataEnthusiast/designing-compute-storage-for-composable-open-data-architecture-on-cloud-61228d0e31)

I hope you found it helpful! Thanks for reading!

Let’s connect on Linkedin!

Authors

Himanshu Gaurav — www.linkedin.com/in/himanshugaurav21

Bala Vignesh S — www.linkedin.com/in/bala-vignesh-s-31101b29