Airflow in Production: What You Need to Know and How to Do It Right

Sepideh Hosseinian
4 min readJun 27, 2023

--

Airflow is an open-source workflow management platform that enables scheduling and monitoring workflows programmatically. It is widely used by data engineers and analysts to orchestrate complex data pipelines and automate tasks. However, running Airflow in production requires some careful planning and configuration to ensure its reliability, scalability and performance. In this article, we will cover some of the best practices and tips for using Airflow in production.

Database Backend
Airflow comes with an SQLite backend by default. This allows the user to run Airflow without any external database. However, such a setup is meant to be used for testing purposes only; running the default setup in production can lead to data loss in multiple scenarios. If you want to run production-grade Airflow, make sure you configure the backend to be an external database such as PostgreSQL or MySQL .

You can change the backend using the following config:

[database]
sql_alchemy_conn = my_conn_string

Once you have changed the backend, Airflow needs to create all the tables required for operation. Create an empty DB and give Airflow’s user permission to CREATE/ALTER it. Once that is done, you can run:

airflow db upgrade

This command keeps track of migrations already applied, so it’s safe to run as often as you need.

Note: Do not use airflow db init as it can create a lot of default connections, charts, etc. which are not required in production DB.

Multi-Node Cluster
Airflow uses SequentialExecutor by default. However, by its nature, the user is limited to executing at most one task at a time. Sequential Executor also pauses the scheduler when it runs a task, hence it is not recommended in a production setup. You should use the LocalExecutor for a single machine. For a multi-node setup, you should use the Kubernetes executor or the Celery executor.

Once you have configured the executor, it is necessary to make sure that every node in the cluster contains the same configuration and DAGs. Airflow sends simple instructions such as “execute task X of DAG Y”, but does not send any DAG files or configuration. You can use a simple cronjob or any other mechanism to sync DAGs and configs across your nodes, e.g., checkout DAGs from git repo every 5 minutes on all nodes.

Logging
If you are using disposable nodes in your cluster, configure the log storage to be a distributed file system (DFS) such as S3 and GCS, or external services such as Stackdriver Logging, Elasticsearch or Amazon CloudWatch. This way, the logs are available even after the node goes down or gets replaced. See Logging for Tasks for configurations.

Note: The logs only appear in your DFS after the task has finished. You can view the logs while the task is running in UI itself.

Configuration
Airflow comes bundled with a default airflow.cfg configuration file. You should use environment variables for configurations that change across deployments e.g. metadata DB, password, etc. You can accomplish this using the format AIRFLOW__{SECTION}__{KEY}

AIRFLOW__DATABASE__SQL_ALCHEMY_CONN= my_conn_id
AIRFLOW__WEBSERVER__BASE_URL= http://host:port

Some configurations such as the Airflow Backend connection URI can be derived from bash commands as well:

export AIRFLOW__CORE__SQL_ALCHEMY_CONN=$(python3 -c 'import urllib.parse; print(urllib.parse.quote_plus("postgresql+psycopg2://airflow:airflow@postgres/airflow"))')

Monitoring
Like any production application, it becomes crucial to monitor the Airflow jobs and of course, Airflow itself. It has a very resilient architecture and the design is highly scalable. It has multiple components to enable this, viz. Scheduler, Webserver, Workers, Executor, and so on. At Gojek, we have a few additional processes as well to enable flexibility for our workflows. For example, we have a separate process running to sync our DAGs with GCS/git and a separate process to sync custom Airflow variables. We know very well that, the more components you have, higher the chances of failure. Hence, this requires a thorough monitoring and alerting system.

Airflow has a built-in statsd client that can send metrics to a statsd server such as Telegraf. You can enable this by setting the following config:

[metrics]
statsd_on = True
statsd_host = localhost
statsd_port = 8125
statsd_prefix = airflow

The statsd client will send all the metrics to Telegraf over UDP. You can configure Telegraf to output the metrics to a time-series database such as InfluxDB, which can then be used as a data source in Grafana or Kapacitor. You can create dashboards and alerts using these tools to monitor the health and performance of your Airflow cluster.

Some of the metrics that you should monitor are:

• DAG run status: success, failure, running, etc.

• Task instance status: success, failure, running, etc.

• Task duration: how long each task takes to run

• Scheduler heartbeat: how often the scheduler checks for new tasks

• DAG bag size: how many DAGs are in your DAG folder

• DAG parsing time: how long it takes to parse all the DAGs

• Queue size: how many tasks are waiting in the queue

• Worker availability: how many workers are online and ready to execute tasks

You can also monitor custom metrics that are specific to your workflows, such as data quality checks, SLA violations, etc. You can use the Stats class in Airflow to send custom metrics to statsd.

Conclusion
Airflow is a powerful and flexible tool for managing workflows and data pipelines. However, it also requires some careful planning and configuration to run it in production. In this article, we covered some of the best practices and tips for using Airflow in production, such as choosing a database backend, setting up a multi-node cluster, configuring logging and monitoring, and using environment variables. We hope that this article will help you bootstrap your Airflow production setup and enjoy its benefits.

--

--