How to cleanup logs collected by Apache Airflow

Mihir Samant
3 min readJul 31, 2024

--

As you progress through your data journey with Apache Airflow, you’ll likely encounter this scenario. Suppose your DAG is scheduled to run daily throughout the year. In that case, you’ll accumulate 365 successful DAG runs and yes I am not considering the one which are scheduled. But that’s not all — if your DAG includes 5 tasks, you’ll generate 5 times as many log files in your log directory for each run (unless you’ve specifically disabled logging in your configuration or settings.py).

However, it’s generally advisable to keep these logs enabled. They are crucial for root cause analysis (RCA) should any issues arise.

Well if you are someone like me, who needs to check on their logs and at the same time dont want to spend too much storage space well this article is for you. In this I will be creating a simple DAG to which performs cleanup on daily basis. Which is basically removing any logs which are older than n-number of days.

Trust me this is one of the easiest and simplest way to do this. There are some other examples available on internet but those looks too complex.

PS: — I am running a bash script inside a DAG nothing special

Just to give you idea of how logs are stored here is the example
All the logs associated with the DAGs are stored in /airflow/logs folder where scheduler related information is inside /airflow/logs/scheduler
And inside scheduler you will be having separate folder for each day and all the logs which were and are scheduled for that day.

|Your Currently running Airflow
|| /airflow
||| /logs

PS: — This is just a sample.

Well that's enough of information now lets do our ‘Tech’ thing.
You can use below code with some modification

Also just quick heads up if you are using Airflow on Docker then dont forget to check below. Its usually there as Airflow dumps logs using same settings on docker-compose.yaml

volumes:
- ../airflow/dags:/opt/airflow/dags
- ../airflow/plugins:/opt/airflow/plugins
- ./logs:/opt/airflow/logs

Well with some minor changes you can use below code
I am keeping Scheduler logs for 7 days and DAG logs for 30 days and for the same reason I have created 2 different Tasks in this DAG.
You can update as per your needs.

from datetime import datetime, timedelta
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago

default_args = {
'owner': 'Airflow',
'start_date': days_ago(1), # Use days_ago to handle relative start dates
'retries': 1,
'retry_delay': timedelta(minutes=5),
'email_on_failure': False,
'email_on_retry': False,
}

# Define the DAG
with DAG(
dag_id='cleanup',
default_args=default_args,
description='A DAG to clean scheduler logs older than 30 days',
schedule_interval='@daily',
catchup=False,
) as dag:

# Define the BashOperator task to clean scheduler logs
clean_scheduler_logs = BashOperator(
task_id='clean_scheduler_logs',
bash_command="""
echo "Cleaning scheduler logs older than 7 days..."
find YOUR/SERVER/PATH/logs/scheduler -type f -mtime +7 -print -delete
""",
)

clean_dag_logs = BashOperator(
task_id='clean_dag_logs',
bash_command="""
BASE_LOG_FOLDER="YOUR/SERVER/PATH/airflow/logs"
MAX_LOG_AGE_IN_DAYS=30
echo "Cleaning DAG logs in $BASE_LOG_FOLDER older than $MAX_LOG_AGE_IN_DAYS days..."
find $BASE_LOG_FOLDER -type f -name '*.log' -mtime +$MAX_LOG_AGE_IN_DAYS -print -delete
find $BASE_LOG_FOLDER -type d -empty -print -delete
""",
)

# ask dependencies
clean_dag_logs >> clean_scheduler_logs

Well you can also visit my GitHub for same code

⚙️ Link to GitHub Repo
Also check other articles on Airflow,

🔗 Follow / connect with me on LinkedIn for more Airflow related content

⚙️ Check out my GitHub

Also check out

-Mihir

--

--