Airflow Part 2: Lessons learned

Nehil Jain
Super.com
Published in
7 min readJun 17, 2018

--

At SnapTravel we use Apache Airflow to orchestrate our batch processes. It is one of the key systems we depend on for keeping track of business metrics, building/testing natural language models, and mining for product insights. After working with Airflow for almost 2 years now, we wanted to consolidate our lessons and share the nuances with other developers.

To better understand our early overview of how Airflow was implemented at SnapTravel, see our prior blog post. Here we discussed the benefits of Airflow, the major components of Airflow architecture and a bunch of resources for those just getting started. In this post we assume the reader is someone who is familiar with concepts like Operator (like PythonOperator), Scheduler, DAGs, and Tasks. If not, give the Airflow Tutorial a quick read!

This post focuses on the Develop part of data pipeline using Airflow. We’ll explore some learnings that consumed more than a couple of hours during our time writing DAGs, along with some best practices picked up along the way. These are quirks/unexpected behaviours of Airflow which might cost you a couple of hours of debugging. Airflow is incubating under the Apache umbrella and being actively improved… our hope is that some of these problems will not be relevant in the upcoming releases!

Python Version for my project — py3 or py2?

--

--