Airflow Tutorial — Monitoring Prometheus, StatsD and Grafana

Published in

KoçDigital

10 min readSep 15, 2022

Airflow is an important scheduling tool in the data engineering world which makes sure that your data arrives on time, takes a step for transforming your data, and perform a dependency check for every process. Main purpose of monitoring your airflow is to be able to know what happened whenever you face an issue along the way.

Basically, on your airflow installation, there is a pre-installed daemon called statsd. Statsd sends metrics to the defined port and these metrics are used for monitoring.

In this tutorial, we cover the environment inside the docker container in a local computer. Let’s continue with the prerequisite as the installation of this prerequisite is necessary for proceeding to the next step:

Prerequisite

Docker

Installing Apache Airflow

First of all, we need to install what we want to monitor, Airflow. I will not cover airflow installation in docker in this post, but anyone in need can refer to official documentation about How to Running Airflow in Docker to install airflow in docker using docker compose.

Tweak airflow installation:

Enable core load example to test the airflow by set AIRFLOW__CORE__LOAD_EXAMPLES=True in environment inside docker-compose.yml file. This variable is for generate examples DAGs in your airflow for testing purpose.
Add below configuration to your docker-compose.yaml under environment. This configuration will enable your airflow statsd to send metrics

AIRFLOW__SCHEDULER__STATSD_ON: 'true'
AIRFLOW__SCHEDULER__STATSD_HOST: statsd-exporter
AIRFLOW__SCHEDULER__STATSD_PORT: 8125
AIRFLOW__SCHEDULER__STATSD_PREFIX: airflow

After that, login to your airflow webserver and using “airflow” as username and password. Your airflow will look like this

example dags from airflow core load example

Installing Statsd-exporter

After your airflow installation is done, the next step is installing statsd exporter. Statsd-exporter has a function to re-map metrics received from airflow and exports them as prometheus metrics. So, statsd-exporter has a role as a bridge between airflow and prometheus.

To install statsd-exporter you need to add service in your airflow docker-compose.yml with this line of code

statsd-exporter:
        image: prom/statsd-exporter
        container_name: airflow-statsd-exporter
        command: "--statsd.listen-udp=:8125 --web.listen-address=:9102"
        ports:
            - 9102:9102
            - 8125:8125/udp

image: Define a place you want to pull the docker image. In my case using a public image from docker hub

volumes: define your additional statsd mapping configuration. If you not using additional mapping configuration your metrics will be exported as it is in Prometheus. This configuration depends on your needs. You can jump to this link Statsd-exporter to see what configuration that you can use in the statsd-exporter. Save the additional configuration in yml file and mount to the service using volumes as above code.

command: You can jump to statsd-exporter documentation to know what commands you can use here. In this case we using three command. — statsd.listen-udp to define the port that airflow can push the metrics.
— web.listen-address to define from where prometheus can fetch the metrics.

(Optional) You can add another command based on the documentation from statsd-exporter here. As an example, if you want to re-map airflow metrics in order to make query in prometheus/garafana easier you can add an additional mapping. To add an additional mapping you can use statsd.mapping-config.

statsd-exporter:
        image: prom/statsd-exporter
        container_name: airflow-statsd-exporter
        command: "--statsd.listen-udp=:8125 --web.listen-address=:9102 --statsd.mapping-config=/tmp/statsd_mapping.yml"
        ports:
            - 9102:9102
            - 8125:8125/udp

You will notice that stated.mapping-config consume yaml file, that’s why you need to define yaml file that contains the additional mapping configuration. Create new file named statsd_mapping.yaml in your root project, and add this configuration

mappings:  - match: "(.+)\\.operator_successes_(.+)$"
    match_metric_type: counter
    name: "af_agg_operator_successes"
    match_type: regex
    labels:
      airflow_id: "$1"
      operator_name: "$2"
  - match: "*.ti_failures"
    match_metric_type: counter
    name: "af_agg_ti_failures"
    labels:
      airflow_id: "$1"
  - match: "*.ti_successes"
    match_metric_type: counter
    name: "af_agg_ti_successes"
    labels:
      airflow_id: "$1"

This is the example of additional mapping configuration for airflow metrics. You can visit statsd documentation for the detail of metric mapping and some example of airflow metrics re-map here.

After defining the configuration file, you need to mount the file to the statsd-exporter service using volume

statsd-exporter:
        image: prom/statsd-exporter
        container_name: airflow-statsd-exporter
        volumes:
            - $PWD/statsd_mapping.yml:/tmp/statsd_mapping.yml 
        command: "--statsd.listen-udp=:8125 --web.listen-address=:9102 --statsd.mapping-config=/tmp/statsd_mapping.yml"
        ports:
            - 9102:9102
            - 8125:8125/udp

After you add statsd-exporter to the docler compose service. you need to build new service

docker-compose up -d

This command will create new container for statsd-exporter that you add before. You can also check your docker and find the container name there. After your docker run, you can got to http://127.0.0.1:9102 to check metrics that send to statsd-exporter

Installing Prometheus

Prometheus is an open-source system monitoring and alerting toolkit. Prometheus saves the metrics as time-series data. Add Prometheus to your docker-compose services by add the following line of code

prometheus:
    image: prom/prometheus
    container_name: airflow-prometheus
    ports:
        - 9090:9090
    volumes:
        - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml

You need to create prometheus configuration to set target scraping. In this case, you want to scrape statsd-exporter that we already made before. Create a new file with the name prometheus.yml inside Prometheus folder in your root project.

scrape_configs:
  - job_name: 'statsd-exporter'
    static_configs:
      - targets: ['airflow-statsd-exporter:9102']

This service will create container name airflow-prometheus in port 9090. After you build your docker-compose you can access prometheus in your web browser by visit http://127.0.0.1:9090. You will see prometheus UI like this

In Prometheus UI, you can query your metrics in the expression box.

Installing Grafana

After you set up your statsd-exporter and Prometheus, now time for the dashboard. Grafana will help you to analyze your metrics with customizable graphs and interactive visualization.

To install Grafana add another service in your docker-compose

grafana:
    image: grafana/grafana:latest
    container_name: airflow-grafana
    environment:
        GF_SECURITY_ADMIN_USER: grafana
        GF_SECURITY_ADMIN_PASSWORD: grafana
    ports:
      - 3000:3000

in this environments parameter, we set the default credential for our grafana

After you build up you can access grafana in port 3000 in your local browser. Login with default user admin and password admin. You will see grafana dashboard like this

Navigate to the left navigation bar and select Explorer

On this page, you can see the query type box, in this box you can query your metrics. But right now you still can’t choose datasource for your grafana. The next step is to add prometheus as grafana datasource.

Create folder grafana/provisioning/datasources/datasources-provision.yaml and define the datasource

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://airflow-prometheus:9090

The above code will tell grafana to make prometheus as a datasource. Beside that, add environment to set your provisioning path

GF_PATHS_PROVISIONING: /grafana/provisioning

and volumes to load your provisioning file to your docker image

volumes:
      - ./grafana/provisioning:/grafana/provisioning

After this, your full service configuration for grafana will look like this

grafana:
    image: grafana/grafana:latest
    container_name: airflow-grafana
    environment:
        GF_SECURITY_ADMIN_USER: grafana
        GF_SECURITY_ADMIN_PASSWORD: grafana
        GF_PATHS_PROVISIONING: /grafana/provisioning
    ports:
      - 3000:3000
    volumes:
      - ./grafana/provisioning:/grafana/provisioning

Now, with this configuration you can query metrics from grafana explore and you ready to create the dashboard.

Creating dashboard

Now you can go to the dashboard page, and try making dashboard for airflow monitoring

Add a new empty panel and you will be redirected to panel configuration

On the left bottom, you will see the query section, change datasource to prometheus and add the query. For example, I will add visualization for DagBag size of airflow.

Change title to “Dag Bag Size”
Add query “airflow_dagbag_size” in the query box
Change visualization type in right top corner as “stats”
Apply the change

Your panel will look like this

So we know that currently, our airflow dagbag size is 34. That’s it how you can add new panel in dashboard. You can try to add new panel and using other visualizations. With all metrics that send from airflow you can make dashboard for monitoring your airflow.

What are some metrics you should monitor?

Here is a list of some of the most critical areas to keep an eye on while Airflow Monitoring, which can also help you troubleshoot and detect resource bottlenecks:

Checks for the Scheduler, Webserver, Workers, and other custom processes to see if they’re up and running.
How long do they stay online?
How many workers are on the move?
Are the Customized Metrics and settings reflected in Metrics?
DAG Parsing Time and the number of active DAGs
Pool Utilization is on the rise.
Job Execution Status (started/completed) is on the rise.
Executor Task Status (running/queued/open slots) is on the rise.
Operator-by-operator Execution Status (success/failure)
Status of Task Instances (successes/failures)
Time spent on Critical Tasks and Sensors is on the rise.
DAGs takes an average of time to reach an end state.
DAGs are delaying their Schedules on a regular basis.
Time spent by DAGs on Dependency Checks is on the rise.

It’s critical to keep track of these KPIs at various levels, including the overall organization, individual tasks, and the DAG level. You should also keep track of your individual operators and tasks that you believe are more likely to fail or require more resources.

Challenges in Airflow Monitoring

Apache Airflow excels at orchestration, which is exactly what it was designed for. However, without some finessing, Airflow Monitoring can be difficult to handle.

The truth is, keeping track of Airflow can be a little time-consuming. When something goes wrong, you’re thrown back and forth between Airflow’s UI, Operational Dashboards, Python Code, and Pages of Logs (We recommend to have more than one monitor to manage all of this). That’s why in Airflow’s 2020 user survey, “logging, monitoring, and alerting” tied for second place.

It is difficult to keep track of airflow mostly because of three interconnected reasons:

1) No Data Awareness

Your Data Pipelines are extremely familiar to Airflow. It understands all there is to know about your tasks, including their status and how long they take to complete. It is aware of the Execution Process. However, it has no knowledge of the data flowing through your DAGs.

There are a lot of things that can go wrong with your data that isn’t shown by execution metadata.

If for some reason, your data source fails to supply any data, the Airflow Webserver’s UI would be all green, but your data consumer’s warehouse would be full of Stale data.
If data is supplied, but one or more columns are blank, Airflow would claim that everything is fine but in reality, your data consumers would be working with incomplete data.
If the data is complete but the transformation is unexpected, a job would not fail as a result of this, but erroneous data would be sent.

You might be able to set some alerts based on Run & Task duration to assist you to know when something goes wrong. However, you wouldn’t have the flexibility you’d need to cover all of your blind spots, and you’d still have to spend time determining the source of the problem. This leads us to the next reason.

2) Limited Monitoring and Alerting Capabilities

Airflow is ideal for Task Orchestration, as previously indicated. That’s exactly why it was created and understandibly, developing a full-featured Airflow Monitoring and observability solution on top of this is not a priority for the Airflow community. However, it couldn’t be completely devoid of Airflow Monitoring capabilities, so there are some useful features surrounding the Pipeline and Tasks.

Airflow gives you a High-level view of your Operational Metadata, such as Run and Task States, as well as the ability to set up simple alerting and get logs. While this is excellent, it lacks the background provided in the first point. As a result, you’ll need to create Operational Dashboards to visualize Metrics across time to see how your data evolves. To pull metadata about your datasets, you’ll need to add data quality listeners to your DAGs (Deequ, Great Expectations, Cluster Policies, Callbacks, and so on). Once you have metadata and trends to work with, you may generate Custom Alerts. That brings us to our last point.

3) Complex Integration with Operational Workflows

Just to keep track of your Airflow Situations, you now have a lot of moving parts. You have Email Alerts, Airflow UI operating metadata and logs, and different Dashboards for Metrics Reporting. If your Airflow environments are limited in scope, this technique may work for you, but if you’re working with 100s of DAGs across several teams, it’s an issue.

Through a single pane of glass, you might not be able to see the health of your Airflow Surroundings thoroughly. Different teams might be using different Dashboards, and Notifications that might not reach to your organization’s preferred receiver may be ignored. As a result of Operational Debt, it would be difficult for your engineers to detect issues early and prioritize solutions before your data SLAs are breached.

Summary

Monitoring Airflow with Grafana has a great potential to save our time by enabling us to understand where a failure occurred and quickly find out what‘s wrong with the airflow. Grafana provides many types of visuals to make extremely user friendly dashboards. As a bonus feature, Grafana’s powerful notification system makes you know about the issues faced promptly to take quick actions.