Apache Airflow and Kubernetes — Pain Points and Plugins to the Rescue

Mason McGough
The Startup
Published in
10 min readMay 20, 2020

--

Apache Airflow is one of the most popular task management systems for orchestrating data pipeline tasks. It is designed primarily with extract-transform-load (ETL) pipelines in mind and supports cloud and Kubernetes deployments. It also provides a built-in UI that is extensible with Python plugins, allowing you to serve your own web pages with custom back-end logic. Most importantly, from its active Slack channel to its well-documented Github page, the rich community backing Airflow is an invaluable resource for any developer looking to use Airflow in a project.

However, the power and versatility of Airflow comes with great complexity. At the time of writing, the airflow.operators module contains 36 unique operators, including everything from email, MySQL, Hive, and Slack, and that does not even count the operators in contrib. The project is vast and deep, not to mention its features are under constant development and revision. Its support for Kubernetes in particular is relatively new and we hit a few snags while trying to employ it for our purposes. In this article, I will explore some of the pain points we struggled with as well as how (and if) we overcame them in our exploration of Kubernetes for Airflow.

Background

--

--

Mason McGough
The Startup

Machine learning engineer with a passion for photography and art