Sending Apache Airflow Logs to S3

Key2Market
3 min readAug 18, 2018

--

I have spent majority of the day today figuring out a way to make Airflow play nice with AWS S3. Not that I want the two to be best friends, but just the log shipping from Airflow to S3 would be just fine. The adventure began when one of the client machines was filling up with logs at 2Gb per day and boosting up the storage was not scalable. And we like scalable, because its a buzz word and everyone is using it.

So, it started fairly innocently with me reading these instructions, then this StackOverflow post, than this, and this and this. Also these instructions, and this GitHub gist.

Armed with all this knowledge, a bit of free time and a lot of patience, I have spent a considerable amount of time playing this various implementation, only to find that none match my requirements in a sense that I want only one singular place to edit any config variable and that place is already taken by Environmental variables.

We will open the Pandora’s Box discussion of using or not using Environmental variables at another time… For now, we needed to define everything through them.

The most accurate article that I have found is this one, and it is the one I will be using to set up step by step instructions:

The magic

  • Set up your environmental variables:
export AIRFLOW__CORE__BASE_S3_LOG_FOLDER=s3://${AWS_S3_BUCKET_NAME}/airflow
export AIRFLOW_CONN_S3_URI=s3://${AWS_ACCESS_KEY_ID}:${AWS_SECRET_ACCESS_KEY}@${AWS_S3_BUCKET_NAME}/airflow

It is maybe a good time to mention that we have set up our S3 bucket and AWS access keys as environmental variables: AWS_S3_BUCKET_NAME, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY. This works well for docker containers too.

  • Create directory $AIRFLOW_HOME/hooks and download the S3 hook into that directory from here. There is also an S3 hook in the Airflow Documentation, but I assume it is outdated as it is different from the one in GitHub.
  • Create a directory $AIRFLOW_HOME/config to store config file there. Add 2 files into that directory: empty __init__.py and log_config.py with the content from this file.
  • Customize the following portions of the log_config.py. Note the BASE_S3_LOG_FOLDER which we exported earlier on. It is important not to miss this part.
Rename DEFAULT_LOGGING_CONFIG to LOGGING CONFIG
LOGGING_CONFIG = ...

Add a S3TaskHandler to the 'handlers' block of the LOGGING_CONFIG variable
's3.task': {
'class': 'airflow.utils.log.s3_task_handler.S3TaskHandler',
'formatter': 'airflow.task',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
's3_log_folder': conf.get('core', 'BASE_S3_LOG_FOLDER'),
'filename_template': FILENAME_TEMPLATE,
},

Update the airflow.task and airflow.task_runner blocks to be 's3.task' instead >of 'file.task'.
'loggers': {
'airflow.task': {
'handlers': ['s3.task'],
...
},
'airflow.task_runner': {
'handlers': ['s3.task'],
...
},
'airflow': {
'handlers': ['console'],
...
},
}
  • Update $AIRFLOW_HOME/airflow.cfg to contain the following adjustments. I have to be honest, I missed this part and spent 20 minutes wondering why my setup was not working. You do not need to change the remote_log_conn_id because we have exported our S3 connection in AIRFLOW_CONN_S3_URI
task_log_reader = s3.task
logging_config_class = log_config.LOGGING_CONFIG
  • Restart the Airflow webserver and scheduler
  • Manually kick of a small and quick task to ensure the logs are written to S3. They should also be visible in the Airflow DAG task UI.

That is all folks! Hope this will save you some time in setting up practical Airflow implementations. For any questions BI you can always contact the K2M team (who did not build Airflow, nor do we manage it, but we are really good at BI).

Kirill Andriychuk

--

--

Key2Market

We help companies set up best Business Intelligence practices