Lessons learnt while Airflow-ing
Vendor/Solution Selection Problem
There are various tasks that run as cron jobs on your servers especially when your company’s core value proposition is built on top of data. Some of these tasks are: batch processes, scheduled periodic data replications, regular metric updates etc. Executing these data pipelines can quickly become irreproducible and messy. I faced a similar set of problems while managing the data assets @ Snaptravel. I wanted to run some batch processes to load data from various resources (static files on S3, APIs) into our application database (Postgres), add logging and retry logic to current cron jobs; I called it smart cronjobs. While doing my research on possible solutions, I had to choose from Luigi and Airflow. Snaptravel uses python heavily so it made sense to use a tool that treats python as first class citizen. After reading Luigi vs Airflow vs Pinball and Hackernews discussion, I decided to go with Airflow because of the various triggering mechanisms, beautiful UI and it being an Apache project (larger community support).
Infrastructure, Setup and Automation
It is not very hard to read the documentation tutorial and get a simple airflow setup going. But running it in production settings, specific to your existing cloud architecture and code can be challenging.
3 services are essential for airflow to run as expected:
- Airflow Webserver: A flask server run using gunicorn. This is responsible to serve the UI Dashboard over http.
- Airflow Scheduler: A daemon built using python-daemon library. The concept of scheduler for airflow is simple but more powerful than a cron. Docs are well written and should be read before writing your Dags.
- Airflow Worker: A wrapper on a celery worker when using Celery Executor.
This blog is a quick read and helps in setting up of distributed infrastructure (Server Worker Cluster). Depending on the size of your data and number of tasks that need to be run at a given time, you need to decide on an executor. I chose to setup CeleryExecutor using Redis/RabbitMQ, as it will let me scale without any major hassle in future.
To setup basic authentication, you can enable password based auth using instructions @ (https://airflow.incubator.apache.org/security.html#password). Also, to bootstrap the database with a admin user at setup time, you can use a script to create your first user, some example code @ https://gist.github.com/nehiljain/0775380f327610ae8b025897716535bd.
Code and Testing
I highly recommend creating a dummy dag (some modified version of this) with 2 tasks that print timestamps, this will help you verify that everything works as you expect. Here are a few things to check: webserver can create new dag runs and updates the UI accordingly, scheduler triggers task runs correctly, workers are able to read task code, run it in the correct environment, task logs are being published properly and finally flower dashboard shows number of active tasks being run on workers (Phew! that’s quite a lot of things to verify before you can proudly say that your setup is complete and functioning).
While writing data flows and transformations scripts, its very important to think about data quality checks along the way. Here are a few things you can start testing: your assumptions of different data types, size of data in the source and the destination, data validation rules based on the transformation rules. Every dag should ideally end in a testing task which runs these tests.
Some other best practices I follow are are that the tasks should be idempotent, stateless and contained. Logging statements should be generously used to log success and failures, this will help you monitor internals of your task with ease. Airflow is a relatively new project and is being developed heavily, by airbnb and apache community together, hence it is quite important to keep your version and code up to date with the latest release.
Also another thing to note is that Airflow is built for static flows. So if your requirements demand dynamic workflow creation then Airflow will not fit the bill. Let me explain. Airflow is design to define dag configurations tied to a dag_id which chains some tasks together and you can very well define these tasks programmatically (hence dynamic). Sometimes, it is required to spawn dags which represents a graph of tasks custom made for that dataset, this is something airflow can achieve but is not good at.
Some personal values that were reinforced by this project:
1. Be hungry for failure not fear it
2. KISS: Complexity and over engineering should be avoided as much as possible
3. Don’t be afraid to dive deeper to debug the problem
My list of airflow reads:
- Podcast.__init__ Airflow episode
- Airflow architecture @ Agari
- Building data pipelines with python
- Airflow: Tips, Tricks and pitfalls
- Airflow and Future of Data Engineering
- Intro to Airflow by Airbnb Data Eng (Arthur)