Better Apache Airflow Observability using OpenTelemetry

Published in

Apache Airflow

9 min readAug 21, 2023

A generic icon of a stylized magnifying glass over a graph which shows an upward trend.

Introduction

Apache Airflow is an orchestration platform to programmatically author, schedule, and execute workflows. OpenTelemetry is used to generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior. The two open source projects seem like a natural fit, and with the launch of Airflow 2.7 users can now start to leverage OpenTelemetry Metrics in Airflow! Airflow has supported emitting metrics via StatsD for some time, and logging has been available through the standard python logger all along. Full OpenTelemetry integration will see these two features merged into a single Open Source standard while also adding Traces. OpenTelemetry Traces allow better insight into how the pipeline is being executed in real-time and how the various modules are interacting. While next on the integration plan, there is no definitive date for that addition at this time.

Configure your Airflow Environment

To enable OpenTelemetry on your existing Airflow environment, you need to install the `otel` extras package and configure a couple of environment variables as explained in the Airflow docs page. You will also need to configure otel-collector and some form of observability platform to visualize and monitor these metrics. In this post I’ll be using Prometheus as the metrics back-end to store the data, and build a dashboard in Grafana to visualize them.

OTel Collector

The OpenTelemetry Collector offers a vendor-agnostic implementation on how to receive, process, and export telemetry data. The collector will gather all Airflow metrics into a central place where Prometheus will fetch them.

For help getting it configured, see The OpenTelemetry Collector Getting Started Guide and check out the Docker Compose file and otel-collector config file which are bundled with Airflow’s developer environment, called Breeze.

If you used the settings from the Airflow pages above and have Airflow and your OTel Collector running in a local Docker container, you can point your browser to localhost:28889/metrics. Here you will see all available raw metrics in Prometheus format. They will look something like this:

A screenshot of a webpage displaying a a list of metrics. The following section goes into detail about what is pictured. — Raw metrics data as displayed by otel-collector

What Are We Looking At?

That is a wall of text, what are we actually looking at here? Each metric that is emitted has three lines on this page:

# HELP is not yet implemented, but will eventually contain a description of the metric.
# TYPE will be one of “counter”, “gauge”, or “timer”. If these terms are new to you, perhaps jump down to Appendix 1 for a very brief summary.
The third line is the metric’s name, any tags if applicable, and the current value.

You could conceivably pull and parse this data directly, but there are existing solutions to do that. We’ll have a look at one option below.

Prometheus

Prometheus is going to be our monitoring and storage solution. Their website has a good Getting Started guide. The Breeze Docker Compose file (linked above) and Prometheus config file may be useful for getting started as well.

If you have been successful in launching the metrics page using the recommended configurations, you should be able to view the Targets at localhost:29090/targets and see something like this:

A screenshot of the Prometheus Targets page shows an active connection with otel-collector. — The “ Targets” page in Prometheus shows an active connection with otel-collector

Grafana

The final step will be setting up Grafana. With Grafana you can create, explore, and share all of your data through beautiful, flexible dashboards. They offer a paid hosting service, but for the sake of this demo you can use their free open source version in yet another Docker container. The Breeze Docker Compose file (linked above) and Breeze configuration files can help you get set up. Note that for Grafana, the config files are spread between a few directories and include files to provision the datasource and a simple default dashboard.

If everything is running using the suggested settings you can point a browser to localhost:23000 and see your Grafana landing page!

A screenshot of the default Grafana landing page. — The default Grafana landing page

Make Your First Grafana Dashboard

If you have gotten this far, then congratulations! You have Airflow working with a full observability stack! Now, let’s check it out.

Before you explore Grafana, below is a sample demo DAG which runs every minute and performs one task which is to wait for a random length of time between 1 and 10 seconds. Drop that into your DAG folder, enable it, and let it run for a number of cycles to get some metrics data generated while you poke around. We’ll use the data it generates later and the longer it is running, the better it will look. So please feel free to let that run and step away for a while before moving on.

import time

from airflow import DAG
from airflow.decorators import task
from airflow.utils.timezone import datetime
from datetime import timedelta
from random import randint


@task
def task1():
    time.sleep(randint(1, 10))


with DAG(
    dag_id='sleep_random',
    start_date=datetime(2021, 1, 1),
    schedule_interval=timedelta(minutes=1),
    catchup=False
) as dag:

    task1()

An old-timey “Intermission” slide from the silent movie era.

Once that has run for a bit: switch over to Grafana, create a new dashboard (the plus sign on the far left), and in that new dashboard add a new empty panel.

A Screenshot of a new dashboard displaying the option to add a new panel. — Adding a new panel to a new dashboard

By default you are presented with a lovely-looking Random Walk graph:

A screenshot of the new panel with the default Random Walk graph. — A new panel shows a Random Walk graph by default

Change the Data Source to “Prometheus” and click on the new “Metrics Browser” button. This will present you with the list of all of the available metrics. Take a little time to poke around at what’s available. If you have run any DAGs recently, there will be all sorts of metrics available regarding task run counts and durations, success counts, etc. If you have not run any DAGs you will still see some options such as the dagbag size, the scheduler heartbeat, and other system metrics. Depending on your system, there may also be a whole slew of others that we don’t necessarily care about for the purpose of this post. By default, all metrics emitted by Airflow are prefixed with `airflow_`, so filtering by that can help narrow down your options.

A screenshot of the “add a metric” drop-down displaying a list of metric names to choose from. — The metric browser is useful for seeing what metrics are available

If you have given that DAG half an hour or so to get some metrics built up, use the Metrics Browser to find the metric named `airflow_dagrun_duration_success_sleep_random.` Leave the other fields at their default settings, and click “Use Query”. You should get something like this:

A screenshot showing a graph of the dagrun duration over time. The durations are between 0 and 10 seconds, as expected. Of note, something is off since every point is repeated four times. — A graph of the DAG run duration over time. As expected, the durations are random between 0 and 10 seconds

Give your query a nice name such as “Task Duration” in the “Legend” field. Depending on your configuration values, you may wish to adjust the “resolution” which lets us display every Nth value. If you see the same value repeated four times every time, as in the screenshot above, you can either adjust your resolution to ¼ or you can adjust the OTEL_INTERVAL environment value (and restart Airflow and rerun the DAG and wait for values to generate again). With the resolution set to ¼, you will see a much cleaner graph:

A screenshot of the same graph as above but with the duplicate data points filtered out. — Adjusting the resolution removes the duplicate data points

Now we can play around with the right menu tab which may be collapsed. If you don’t see options to the right, there is an arrow in the top right corner directly under the ”Apply” button to display it. Give your panel a name, such as “Random Sleep Duration (1–10s)”, maybe make it a bar graph with a fill opacity of 50 and set the gradient mode to “opacity”. Under “Standard Options” we can set the unit to “Time/seconds (s)”, set the min to 0 and the max to 12. When you are done playing around, click “Apply” in the top right corner. This will bring you back to the dashboard view and you should see something like this!

A screenshot of the DAG run duration over time as a bar graph using the settings suggested in the above section. — A bar graph of DAG run duration over time using the suggested settings

What you have here is a graph showing the time required to run that DAG for each time it ran. You’ll remember we told it to wait a random length of time between 1 and 10 seconds, so it should look pretty random. You may also notice that some are slightly longer than 10 seconds. This is due to system overhead and is exactly one reason you might wish to use these metrics! While the task actually slept up to 10 seconds, there is some system overhead in starting and ending the task that gets tacked on. In the graph above we can see that the total overhead was always under 2 seconds since the graph never hit 12s. A closer look at the actual metric numbers shows that the overhead averages around 1.2s pretty consistently and I’ve decided that this is acceptable for my use case. If this was a production environment, that overhead might warrant further investigation or perhaps an alarm to make sure it does not get worse over time which might indicate a resource leak.

Drag that panel larger in either direction and note that Grafana will automatically adjust the scale and labels on both axes! When you’ve found a size you like, click the refresh button in the top right corner (in Grafana, not for your browser tab!) and select a frequency to get it auto-updating. You should now have a dashboard which shows your task duration, and automatically updates itself with a new value every minute or so when the DAG runs!

What’s Next?

What’s next for you?

If you are interested in exploring more about making better use of Grafana to build better dashboards and alerts, the Grafana Fundamentals guide might be a good place to start.

If you are interested in learning more about Airflow or have any questions, join the conversation over on the Airflow community slack server!

What’s next for Airflow and OpenTelemetry?

Next, we’ll be adding support for perhaps the most interesting feature of OTel: Traces! Traces give us the big picture of what is actually happening under the hood when a pipeline is running, and helps visualize the full “path” of its task runs. For example, when combined with the duration metrics we’ve already explored, we will be able to automatically generate gantt charts to help find bottlenecks slowing down your DAGs.

Appendix 1 — A Very Brief Overview of Metrics

There are currently three types of metrics supported by Airflow: Counters, Gauges, and Timers. This appendix will provide a very brief and simplified overview of what those mean in Airflow.

Counters

Counters are integers which are incremented or decremented by a value. As of this writing, all but one is a monotonic counter which means it can only increase. For example, the odometer in your car or the number of tasks completed since you launched Airflow. If you can say “add one more”, then you are likely dealing with a Counter. See here for a list of Counters available in Airflow.

Gauges

Gauges are floats which can go up or down. The main difference between a Counter and a Gauge is that a Gauge is a momentary reading, not an incremental change. For example, think of your thermometer or the number of DAGs in your dagbag. When you read a thermometer you see the current temperature, you don’t generally see “it is three degrees warmer than the last time you looked”. If you find yourself thinking “what is the current value?” you may be thinking of a Gauge. See here for a list of Gauges available in Airflow.

Timers

Timers are the most obvious type. They report how long something took, just as you would expect. See here for a list of Timers available in Airflow.

Dive deeper

For more information on metrics in Airflow check out their documentation here. For more information on metric types supported by OpenTelemetry see their documentation here.