Don’t Let your Airflow Metadata just sit there! Start Min(d)ing it.

Jim Todd
5 min readJan 4, 2022

--

Apache Airflow is a popular orchestrator as many Data Engineers already use it in Production since a long time now .Within few months of using Airflow in Production, I started thinking, Is Airflow metadata underrated? Now, I got some time to think and write of it. I use Airflow v1.10.10 with postgres backend and a local Executor.

Airflow collects the DAG schedules, history of task runs, user activities and lot more in the metadata DB. Of course, Airflow itself offers capabilities to derive metrics out of metadata, and also to Visualize them like the Gantt charts, Graph View etc. But, from the Data Platform perspective, having a bespoke aggregated view of this metadata can prove more insightful. There are lot more information in the airflow metadata, and mining it will open plenty of automation thoughts & self healing ideas. Also, with recent versions of Airflow, there are some UI features taken back for security reasons. With this introduction, I share a few interesting insights in mining the Airflow metadata. Happy Reading!

Who touched my DAG? — Understand the critical user activities in Airflow UI.

Its always discouraged to manually disturb the DAGs from its schedule. However, in modern Data Engineering jobs, it cannot be avoided. For example, we take the case of pause/un pause a DAG. A DAG that pulls data from source and loads into a target table has to be paused, if we are informed there is junk data, loaded in the source. Once the data in source is fixed, and we are intimated, the DAG can be un paused. This avoids the ripple effect, and saves time for Data Engineers. The jobmetadata table has this information. There is no way of understanding why the DAG was paused & un paused, from Airflow UI. But, we can find who has done it, so reach out to understand why. One more case could be, sometimes, we would want to clear the DAG run, and re-run it. This also can be tracked.

Manually Paused & Cleared DAGs

It can be extended to plenty of other possibilities, including other events or drilling down to the exact user who was responsible for the event or the time period of the event.

If you are still with me, great! Continue Reading.

For simplicity, I haven’t added the SQLs thats used for mining the metadata. Visit this Git Repo to get the SQLs.

Where should I start looking for optimization? — Top 10 longest ran tasks in past 24 hours.

Monitoring this metric could be the key, if you are trying to find the top candidates for performance optimization. When I say performance optimization, nothing on (should be) the Airflow side, as its always recommended to use Airflow as a simple orchestrator and not an ETL tool. The tasks in a DAG would always be some kind of a calling function, that performs the task, unless it is very simple that can be done by the airflow server itself. In my project, there are airflow tasks submitting jobs to spark clusters. So, the task that ran for the maximum time today, could mean that I should look into the performance of the spark job that was submitted. Once I optimize the long running spark job, the task execution time should eventually be reduced. I got the below data from task_instancetable.

Top 10 long running tasks (dag_id & task_id hashed)

It eases the task of Data Engineers by giving a jump start in analyzing the issue in right direction. One could easily counter argue this, saying this can be rather identified from the spark cluster logs. However, analyzing metadata is a good starting point when airflow interacts with multiple APIs and webservices, which is the case in most of the organizations.

How occupied is my Executor? — Check Task runs in the last 24 hours, divided by 6 hours window

Most of the new age companies discourage night shifts and try to overcome the shifts with offshore-onshore model with a decent overlap of time zones. The first & last 6 hours window in 24 hours, i.e 0–5 and 6–23 are usually the non-working hours. A DAG failure and alert can easily go unnoticed in this window. Its best to have the master DAGs running during working hours. In my experience, mostly, that’s not the case, as upstream data collection are nightly batches in all of the data warehouses. What I wanted to stress here is, this metric can be a good starting point if you realize its high time to refactor the DAG schedules, especially when your Airflow is on a simple LocalExecutor. It gives us a rough idea in understanding how the Airflow scheduler is occupied over a day. It could be helpful also in the following cases:

1. When the airflow runs with a local executor, there is always a chance of more tasks getting queued up when a specific time window is heaped with DAG runs. In this case, check if few DAGs can be moved to the next available window.

2. We can understand what the peak times are, when most of the DAGs are running. Further, we can choose the lean period to do major maintenance activities like airflow version upgrade or any other upgrade/downgrade that could potentially take the airflow down. This will help in smoother transition.

Number of Task runs for every 6 hour window

Drilling down to hour level & visualizing it — To have least downtime, I would schedule/plan my maintenance activities in any of the red marked hour slots where minimum number of DAG runs get disturbed.

Count of tasks ran each hour of the day

Tips:

  • Create a simple plugin in Airflow itself to display these metrics. But, have an option to fire the query in the same UI only when necessary.
A screenshot of the plugin we created to monitor long running tasks.
  • Copy/stream the Airflow metadata to another RDBMS and build a dashboard on top of it, as it will mitigate the risk of overloading the DB.

Note:

  • In recent versions of Airflow, the backend exposure in the UI has been limited owing to security reasons
  • I should also mention, we have to be mindful when we have a very light backend DB for airflow, as deadlocking the metadata resources with our queries might put the airflow ‘s usual activities into risk, as this database plays the very core role for airflow itself.

Useful links:

There are lot more interesting data collection happening in the backend, check it out.

Happy Learning & Sharing!

--

--

Jim Todd

Data Engineer who loves data and to learn more about it.