Scalable Data Engineering Platform on Cloud

Deploy Apache Airflow on AWS ECS (using EC2). Part 1

8 min readMay 16, 2018

Besides finding a right ETL workflow framework to collect, process, analyze, and deliver tons of data, finding the best way to manage, deploy and scale it is equally important. It is not only the humongous data volume but the level of adaptability required to address continuously evolving business needs that make ETL system complicated.

One of the best ways to implement scalable Data Pipeline is using AWS lambda. Moreover, Combining Lambdas with AWS kinesis let us build sophisticated real-time Data Pipeline. However, in many scenarios, we face AWS Lambda’s limitation. Airflow is one of the best tools which helps us to overcome those limitations.

This post is a step by step tutorial for deploying Airflow with celery executor on AWS ECS.

You can get all the codes from https://github.com/fartashh/Airflow-on-ECS

What is & Why Airflow?

Airflow, an Apache project open-sourced by Airbnb, is a platform to author, schedule and monitor workflows and data pipelines. Airflow jobs are described as directed acyclic graphs (DAGs), which define pipelines by specifying: what tasks to run, what dependencies they have, the job priority, how often to run, when to start/stop, what to do on job failures/retries, etc. Typically, Airflow works in a distributed setting, as shown in the diagram below. The airflow scheduler schedules jobs according to the schedules/dependencies defined in the DAG, and the airflow workers pick up and run jobs with their loads properly balanced. All job information is stored in the meta DB, which is always updated in a timely manner. The users can monitor their jobs via an Airflow web UI as well as the logs.

Why Airflow? Among the top reasons, Airflow enables/provides:

Pipelines configured as code (Python), allowing for dynamic pipeline generation
A rich set of operators and executors for use and potentially more (you can write your own)
High scalability in terms of adding or removing workers easily
Flexible task dependency definitions with subdags and task branching
Flexible schedule settings and backfilling
Inherent support for task priority settings and load management
Various types of connections to use: DB, S3, SSH, HDFS, etc.
Nice logging and alerting
Fantastic web UI showing graph view, tree view, task duration, number of retries and more

What is & Why Amazon Elastic Container Service?

Amazon Elastic Container Service (Amazon ECS) is a highly scalable, high-performance container orchestration service that supports Docker containers and allows you to easily run and scale containerized applications on AWS. Amazon ECS eliminates the need for you to install and operate your own container orchestration software, manage and scale a cluster of virtual machines, or schedule containers on those virtual machines.

Why use Amazon ECS?

Run containers without servers
Amazon ECS features AWS Fargate, so you can deploy and manage containers without having to provision or manage servers. I explain how to deploy Airflow on AWS Fargate in part 2.
Containerize everything
Amazon ECS lets you easily build all types of containerized applications, from long-running applications and microservices to batch jobs and machine learning applications.
Secure
Amazon ECS launches your containers in your own Amazon VPC, allowing you to use your VPC security groups and network ACLs. No compute resources are shared with other customers.
Performance at scale
Amazon ECS is built on technology developed from many years of experience running highly scalable services. You can launch tens or tens of thousands of Docker containers in seconds using Amazon ECS with no additional complexity.
Designed for use with other AWS Services
Amazon ECS is integrated with the following AWS services, providing you a complete solution for running a wide range of containerized applications or services: Elastic Load Balancing, Amazon VPC, AWS IAM, Amazon ECR, AWS Batch, Amazon CloudWatch, AWS CloudFormation, AWS CodeStar, and AWS CloudTrail

How to Deploy Airflow on ECS

Preparation Step;

I am not going to discuss this in detail but you need to properly setup your AWS environment. I suggest to setup a VPC with two private and one public subnet. Next, during the creation of RDS and Elastic-cache, you can create a subnet group by using two private subnet. learn more

Also, you need to create two security group; the first one for RDS and EC which facilitate the access to those instances inside VPC. And the second one to give the public access to ECS.

In addition, you need to setup and install ECS CLI, AWS CLI, and Docker on your environment.

Step 1: Setup Backend Database

In order to use Airflow in production environment we need to setup database. As Airflow was build to interact with its metadata using the great SqlAlchemy library we are able to use any database backend supported as a SqlAlchemy backend.

Since, we are using AWS environment let’s setup our database on Amazon Relational Database Service. I decided to use PostgreSQL but feel free to choose MySQl if you like it more.

Step 2: Setup Redis

CeleryExecutor is one of the ways you can scale out the number of workers. For this to work, you need to setup a Celery backend. you can setup Redis or RabbitMQ.

I decided to use AWS ElastiCache to setup Reids for CeleryExecuter.

Step 3: Project Structure

Let’s start by understanding the project structure;

airflow_home
Contains Airflow dags.
app
I would like to separate logic of ETL system from Airflow Dags. All system logic and codes will be stored in this folder.
config
Contains Airflow config file.
scripts
This folder contains required script to setup the system. I explain the scripts shortly.

Step 4: Docker file

I get the docker file from docker-airflow repository and update it. lets looks at the changes;

line 17: add new Arg for application home folder
line 60: let you customize your Airflow installation
line 73–78: copy scripts, app and dag

Dockerfile

Step 5: Scripts

create-user.py let us to create a Airflow user. Update username, password and email.

create-user.py

entrypoint.sh responsible to set environment variable required. You need to update Redis and Postgreql credentials and host names.

entrypoint.sh

Step 6: Build the Image

We have everything we need in place to build the image from docker file.

>>> docker build -t airflow-on-ecs .

Step 7: Run and Test the image

If the PostgreSql database and redis are publicly accessable you can test your image by running

>>> docker run -d -p 8080:8080 airflow-on-ecs webserver

Step 8: Create an Amazon ECR repository and Push Image

The airflow-on-ecs docker image is needed to be hosted on Docker hub or in AWS ECR. I choose to use ECR repository.

>>> aws ecr create-repository --repository-name [repo-name]

Get Login form aws ecr

>>> aws ecr get-login --no-include-email --region us-east-1

Run dokcer login

>>> docker login -u AWS -p eyJwYXlsb2FkIjoibXZWMEZjajJ2cEFVTHJXN2QwUzBMUFhiVlJ5cVV5TzJZbEh1... https://*****.dkr.ecr.us-east-1.amazonaws.com

Tag docker image

>>> docker tag nationkey-airflow:pro ****.dkr.ecr.us-east-1.amazonaws.com/nationkey-airflow:pro

push image into repository

>>> docker push ****.dkr.ecr.us-east-1.amazonaws.com/nationkey-airflow:pro

Step 9: Configure the ECS CLI

Before you can start, you must install and configure the Amazon ECS CLI. For more information, see Installing the Amazon ECS CLI.

The ECS CLI requires credentials in order to make API requests on your behalf. It can pull credentials from environment variables, an AWS profile, or an Amazon ECS profile. For more information see Configuring the Amazon ECS CLI.

Set up a CLI profile

With the following command, substituting profile_name with your desired profile name, $AWS_ACCESS_KEY_ID and$AWS_SECRET_ACCESS_KEY environment variables with your AWS credentials.

>>> ecs-cli configure profile --profile-name profile_name --access-key $AWS_ACCESS_KEY_ID --secret-key $AWS_SECRET_ACCESS_KEY

Create an ECS CLI Configuration

Complete the configuration with the following command, substituting launch type with the launch type you want to use by default, region_name with your desired AWS region, cluster_name with the name of an existing Amazon ECS cluster or a new cluster to use, andconfiguration_name for the name you'd like to give this configuration.

>>> ecs-cli configure --cluster cluster_name --default-launch-type launch_type --region region_name --config-name configuration_name

Step 10: Create Cluster

The first action you should take is to create a cluster of Amazon ECS container instances that you can launch your containers on with the ecs-cli up command. There are many options that you can choose to configure your cluster with this command, but most of them are optional.

>>> ecs-cli up --vpc vpc-c7axxbbc --subnets subnet-a4xxbxf9 --security-group sg-e9bxxxa1 --keypair datalab --capability-iam -size 1 --instance-type t2.large --cluster-config datalab --force

Step 11: Update docker compose file with image url in ECR

In this step we create dokcer-compose file. In case you decided to not define amount of CPU and RAM for each container you can easily remove cpu_shares and mem_limit from compose file. if you decided to keep it make sure your instance type has sufficient CPU and memory.

I suggest to allocate at least 1 Gigabyte if you want an stable environment.

Replace the image with the image name you pushed into ECR repository. And replace FERENT_KEY with result of;

>>> python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print(FERNET_KEY)"

Step 12: Create an ECS Service from a Compose File

Finally, we can create a service from the compose file with the ecs-cli compose service up command. This command creates a task definition from the latest compose file (if it does not already exist) and creates an ECS service with it, with a desired count of 1.

>>> ecs-cli compose --project-name PROJECT-NAME service up --create-log-groups

Step 12: Check running containers

let’s check the running containers by executing ecs-cli ps. You can copy and past the Airflow and flower address into your browser.

>>> ecs-cli ps

Step 13: Check containers log

you can easily check any container logs by using following command.

you can find task ids by running ecs-cli ps

>>> ecs-cli logs --task-id f7a715ff-2b5b-4a95-adba-00e95fc926c4 --follow

Step 14: Scale the tasks on Cluster

you can easily scale the tasks buy running;

ecs-cli compose --file scale 2 --cluster-config datalab