Batch Job Scheduling using Apache Airflow in Docker Container in AWS

Harinath Selvaraj
coding&stuff
Published in
4 min readFeb 20, 2020

This article contains details of (1) What is Airflow (2) Why to choose Airflow over other tools (3) Simple Steps to Setup Airflow in Docker & Run on AWS container (including storing logs on S3 bucket; storing run details on Aurora or any other MySQL Database).

What is Airflow?

Airflow is an opensource tool to schedule and monitor workflows. It was originally developed by Airbnb in 2014 and was later was made open-source and is a project of Apache Foundation.

Airflow executes the workflows (jobs) as Directed Acyclic Graphs (DAGs).

Why DAG?

For instance, you would have a scenario where Job 2 and 3 has to run once Job 1 has finished. Job 4 and 5 should run after Job 2 finished. The downstream jobs should be triggered only when the upstream jobs finish successfully. Setting up CRON jobs will only enable us to schedule the Job whereas DAG architecture allows us to set dependency between workflows.

Sample DAG Flow

Airflow provides a GUI that makes it easy to monitor jobs, checking logs and history & Identify failures. There are a lot of use cases for Airflow. Few of them include running shell scripts, python jobs, machine learning tasks, Dev Ops tasks etc.

Simple Steps to Setup Airflow in Docker & Run on AWS

Note: You need to have Docker installed on your machine. Below are the exact steps required to install Airflow and make it run in less than 15 minutes. All the related files are in https://github.com/harinathselvaraj/airflow. It also contains a sample template python scripts for S3 to Redshift copy and Redshift table to table load.

Step 1: Pull the latest version of the airflow docker image from Docker hub

docker pull puckel/docker-airflow

Step 2: View the docker image using the below command

docker images

Step 3: Create and Run the container

docker run -d -p 8080:8080 puckel/docker-airflow webserver

Step 4: Copy the container ID (you will need it for the next steps). In the below example, c06671b5855e is my container ID. Please replace c06671b5855e by your container ID in the following commands,

docker ps -a
// c06671b5855e

Step 5 (Optional): Changes in airflow.cfg file — Change the Airflow database to point to MySQL table; so that we can check the logs even when the docker container (which has airflow) is not running. I used AWS Aurora MySQL Database since I’m using AWS for all of my tasks.

// Database Connections
make changes in LocAL airflow.cfg file -
sql_alchemy_conn = mysql+mysqldb://<db_user>:<mysql_aurora_instance_name>:3306/<database_name>

Step 5:

Part A (Mandatory) — Create new folders — ‘dags’ and ‘config’ in order to store the DAGs (batch jobs) and copy configuration files for saving log files in S3.

Part B (Optional) — It also involves installing libraries — boto3 (used for connecting to S3 to execute Redshift Copy commands to load files from S3 to Redshift tables). psycopg2 (to connect to Redshift database), slack-webhook (used to send notifications to Slack channels during program success/failure). These libraries are only required if you want to use them.

docker exec -ti --user root c06671b5855e bash
mkdir dags
mkdir config
chmod 777 dags
chmod 777 config
pip install boto3 psycopg2 slack-webhook
airflow initdb
exit

Step 6: Copy all the new DAGs (that you’ve created) from the local system to the container.

docker cp path/to/dags/folder/ c06671b5855e:/usr/local/airflow/

Step 7: Copy the config file — airflow.cfg from local system to the container.

//Copy config file
docker cp path/to/file/airflow.cfg c06671b5855e:/usr/local/airflow/

Step 8 (Optional): Copy the config folder from the local system to the container. This contains files which are mandatory if you want to store the logs on the S3 bucket.

//copy config folder for S3
docker cp path/to/folder/config/ c06671b5855e:/usr/local/airflow/

Step 9 (Optional): Copy the config folder from the local system to the container. This contains files which are mandatory if you want to store the logs on the S3 bucket.

//copy config folder for S3
docker cp path/to/folder/config/ c06671b5855e:/usr/local/airflow/

Step 10 (Optional): Front end changes for setting up S3 connections to store DAG run logs:

Go to localhost:8080 (After running your container).

Go to Admin → Connections. Create a new connection with the below details,

Conn Id: MyS3Conn
Conn Type: S3
Extra: {"aws_access_key_id":"<access_key_id>", "aws_secret_access_key": "<aws_secret_access_key>"}

Step 11: After running a few DAG jobs in Airflow, save the modified container as a new image.

docker stop c06671b5855e
docker commit c06671b5855e docker_img_airflow

Step 12: Tag the docker image to a version and push it to the Docker registry.

docker tag docker_img_airflow <target_docker_hub>/docker_img_airflow:2.42
docker push <target_docker_hub>/docker_img_airflow:2.42

Step 13 (Optional): If you have Rancher (platform to manage the containers) Installed, you can create a new service and then tag the image to the service and click on the link to view airflow up and running! 😃

Please comment in case of any issues. I am glad to help!

You may clap if you liked my post! Thanks for reading, Have a fab day! 😄

--

--