#3 Airflow in Production: reliable and effective monitoring

Published in

hipay-tech

4 min readAug 1, 2022

Hey everyone ! 👋

Once you’ve welcomed Airflow as you data stack’s orchestrator, it will quickly become a critical piece of infrastructure.
If you decide to use it in production, you’ll need to make sure it’s running smoothly 24/7, as all your pipeline executions directly depend on it.

In this story, we will share with you the choices we made at HiPay to make sure of that. This will be about exhaustive and effective monitoring.

TL;DR:

Running a stable and robust Airflow instance requires several levels of monitoring, with alerting (Slack works 10/10) on top of them:

Host monitoring: use an IT Infrastructure Monitoring tool to make sure the host Airflow runs on is up and healthy at all times (if you’re not using a serverless version of Airflow)
Airflow service monitoring: are all the Airflow services (scheduler ad back-end database) running smoothly 24/7 ?
End-to-end monitoring: is you instance capable of running a dummy DAG from A to Z ?

Please note that we are currently running Airflow on premises, as we are currently going through a cloud migration process. This means we have a hybrid architecture (cloud and on premises ressources working together). The decision to host and manage Airflow by ourselves was made as cloud services are easily reachable from any location, whereas our on premises servers are under strict security policies.

We consider data systems as production systems. We use a DTAP architecture and our on premises servers are replicated on different data centers and locations to ensure backup and maximum availability of our services, as we are in the payment industry.

We reveal in this story how we monitor our Airflow instances in order to be 99.9999% sure it’s running as expected at all times, and to be able to react super fast in case of emergency. This has been working very well so far.

Please note that you don’t need to implement all these measures, just take what suits your needs and possibilities 😊

You need to know when production things are not working !

1. Make sure your hosts are doing good

Apache Airflow lives on a host (server or VM) and depends on it. Therefore, you need to monitor the health of your hosts.
Your Ops team most certainly has some kind of IT Infrastructure Monitoring tool like Datadog, Zabbix or Nagios, use it for your Airflow hosts as you would use it for other production servers.
You need to make sure that everything is OK at all times on the hosts (memory consumption, available storage, CPU load, etc.), and get alerts the moment there’s an issue in order to be able to respond quickly.

2. Make sure all Airflow services are running

Airflow provides a health check API endpoint at /health (official doc) that returns a JSON like this one with the health status of the two main Airflow services the back-end database and the scheduler:

{
  "metadatabase":{
    "status":"healthy"
  },
  "scheduler":{
    "status":"healthy",
    "latest_scheduler_heartbeat":"2018-12-26 17:15:11+00:00"
  }
}

This is convenient, but only works if a web server is up. If the web server (serving the WebUI) is down or unreachable, you will lose the ability to monitor your core Airflow services.

This is why we prefer to do CLI checks from a remote location, using our favorite IT Infrastructure Monitoring Tool (again !). Here are the commands:

# This first command checks if the local scheduler is up and running
airflow jobs check --job-type SchedulerJob --hostname "$(hostname)";
# This one checks the status of the Airflow database
airflow db check;

These checks can be made every 10min or so. More information here.

You could also adopt a more integrated approach and periodically check if Airflow is able to run an end-to-end DAG, connect to databases, etc.
This is done using a Canary DAG that doesn’t do any real work. It contains calls to the DummyOperator, PythonOperator (print something) & BashOperator (echo something), and can establish database connections.

3. Get an alert when a task run fails

This one is absolutely essential. It’s just about getting alerted when a task run fails. This will greatly improve your response time, and ease your analysis process.
At HiPay, when any of our task fail, we immediately get a notification in a dedicated Slack channel with a link to the log. This often sparks a conversation thread that stops when the problem is fixed.

An example of Airflow fail notification via Slack

Give us a little 👏 if you found this to be useful ! or leave a comment if you want more focus put on specific aspects in the future 😉

Thanks for reading !