Airflow Catchup & Backfill — Demystified

Amit Singh Rathore
Nerd For Tech
Published in
3 min readJan 26, 2021

In my previous blog, we looked at the basics of Airflow. This blog will cover some advanced topics.

Airflow allows missed DAG Runs to be scheduled again so that the pipelines catchup on the schedules that were missed for some reason. It also allows re-running of DAGs in back date manually & backfill those runs. Backfill & Catchup are confusing at first glance. In this blog, we will understand the concepts. But before we start on these we need to refresh about “start_date” and “execution_date”.

Start Date & Execution Date

start_date date at which DAG will start being scheduled

schedule_interval the interval of time from the minimum start_date at which we want our DAG to be triggered.

A DAG with start date at 2021–01–26T05:00:00 UTC and schedule interval of 1 hr, get actually executed at 2021–01–26T06:00:00 for data coming from 2021–01–26T05:00:00.

Trigger Point → start_date + { schedule_interval } → till the end.

The execution date is the beginning Of the period for which data needs to be processed from.

Catchup

By default, Airflow will run any past scheduled intervals that have not been run. In order to avoid catchup, we need to explicitly pass the parameter catchup=False in the DAG definition.

Let us understand this with an example where we have catchup=true.

In the above image, we have an initial config where everything is fine and our DAG Run happened at 6. Then we paused the DAG.

Here we see that since at the next schedule DAG run was paused hence start_date for the schedule is not available.

At the next schedule the same happened (the DAG run was not triggered). Now we enable or schedule the DAG run from the console.

In the above diagram, we see at the next schedule, previously missed DAG Runs were triggered. Notice that start_date is the next schedule (9). While the Execution dates are the actual ones, if you notice, start_date is the same for the last three DAG runs. This denotes backfill. So the first DAG run for the execution date of 6 happened then for 7 and then for 8.

Backfill

If for some reason we want to re-run DAGs on certain schedules manually we can use the following CLI command to do so.

airflow backfill -s <START_DATE> -e <END_DATE> --rerun_failed_tasks -B <DAG_NAME>

This will execute all DAG runs that were scheduled between START_DATE & END_DATE irrespective of the value of the catchup parameter in airflow.cfg.

Note here -B means we want DAG Runs to happen in backwards. Latest date first then the older dates.

airflow backfill -m -s <START_DATE> -e <END_DATE> <DAG_NAME>

The above command will mark all tasks “success” for the given interval.

Note: With the above CLI commands for backfill DAG run will have the backfill_ prefix in its ID.

Hope this was helpful!!

--

--

Amit Singh Rathore
Nerd For Tech

Staff Data Engineer @ Visa — Writes about Cloud | Big Data | ML