Scalable Data Engineering Platform on Cloud
Deploy Apache Airflow on AWS ECS (using EC2). Part 1
Besides finding a right ETL workflow framework to collect, process, analyze, and deliver tons of data, finding the best way to manage, deploy and scale it is equally important. It is not only the humongous data volume but the level of adaptability required to address continuously evolving business needs that make ETL system complicated.
One of the best ways to implement scalable Data Pipeline is using AWS lambda. Moreover, Combining Lambdas with AWS kinesis let us build sophisticated real-time Data Pipeline. However, in many scenarios, we face AWS Lambda’s limitation. Airflow is one of the best tools which helps us to overcome those limitations.
This post is a step by step tutorial for deploying Airflow with celery executor on AWS ECS.
You can get all the codes from https://github.com/fartashh/Airflow-on-ECS
What is & Why Airflow?
Airflow, an Apache project open-sourced by Airbnb, is a platform to author, schedule and monitor workflows and data pipelines. Airflow jobs are described as directed acyclic graphs (DAGs), which define pipelines by specifying: what tasks to run, what dependencies they have, the job priority, how often to run, when to start/stop, what to do on job failures/retries, etc. Typically, Airflow works in a distributed setting, as shown in the diagram below. The airflow scheduler schedules jobs according to the schedules/dependencies defined in the DAG, and the airflow workers pick up and run jobs with their loads properly balanced. All job information is stored in the meta DB, which is always updated in a timely manner. The users can monitor their jobs via an Airflow web UI as well as the logs.
Why Airflow? Among the top reasons, Airflow enables/provides:
- Pipelines configured as code (Python), allowing for dynamic pipeline generation
- A rich set of operators and executors for use and potentially more (you can write your own)
- High scalability in terms of adding or removing workers easily
- Flexible task dependency definitions with subdags and task branching
- Flexible schedule settings and backfilling
- Inherent support for task priority settings and load management
- Various types of connections to use: DB, S3, SSH, HDFS, etc.
- Nice logging and alerting
- Fantastic web UI showing graph view, tree view, task duration, number of retries and more
What is & Why Amazon Elastic Container Service?
Amazon Elastic Container Service (Amazon ECS) is a highly scalable, high-performance container orchestration service that supports Docker containers and allows you to easily run and scale containerized applications on AWS. Amazon ECS eliminates the need for you to install and operate your own container orchestration software, manage and scale a cluster of virtual machines, or schedule containers on those virtual machines.
Why use Amazon ECS?
- Run containers without servers
Amazon ECS features AWS Fargate, so you can deploy and manage containers without having to provision or manage servers. I explain how to deploy Airflow on AWS Fargate in part 2. - Containerize everything
Amazon ECS lets you easily build all types of containerized applications, from long-running applications and microservices to batch jobs and machine learning applications. - Secure
Amazon ECS launches your containers in your own Amazon VPC, allowing you to use your VPC security groups and network ACLs. No compute resources are shared with other customers. - Performance at scale
Amazon ECS is built on technology developed from many years of experience running highly scalable services. You can launch tens or tens of thousands of Docker containers in seconds using Amazon ECS with no additional complexity. - Designed for use with other AWS Services
Amazon ECS is integrated with the following AWS services, providing you a complete solution for running a wide range of containerized applications or services: Elastic Load Balancing, Amazon VPC, AWS IAM, Amazon ECR, AWS Batch, Amazon CloudWatch, AWS CloudFormation, AWS CodeStar, and AWS CloudTrail
How to Deploy Airflow on ECS
Preparation Step;
I am not going to discuss this in detail but you need to properly setup your AWS environment. I suggest to setup a VPC with two private and one public subnet. Next, during the creation of RDS and Elastic-cache, you can create a subnet group by using two private subnet. learn more
Also, you need to create two security group; the first one for RDS and EC which facilitate the access to those instances inside VPC. And the second one to give the public access to ECS.
In addition, you need to setup and install ECS CLI, AWS CLI, and Docker on your environment.
Step 1: Setup Backend Database
In order to use Airflow in production environment we need to setup database. As Airflow was build to interact with its metadata using the great SqlAlchemy library we are able to use any database backend supported as a SqlAlchemy backend.
Since, we are using AWS environment let’s setup our database on Amazon Relational Database Service. I decided to use PostgreSQL but feel free to choose MySQl if you like it more.
Step 2: Setup Redis
CeleryExecutor
is one of the ways you can scale out the number of workers. For this to work, you need to setup a Celery backend. you can setup Redis or RabbitMQ.
I decided to use AWS ElastiCache to setup Reids for CeleryExecuter
.
Step 3: Project Structure
Let’s start by understanding the project structure;
airflow_home
Contains Airflow dags.app
I would like to separate logic of ETL system from Airflow Dags. All system logic and codes will be stored in this folder.config
Contains Airflow config file.scripts
This folder contains required script to setup the system. I explain the scripts shortly.
Step 4: Docker file
I get the docker file from docker-airflow repository and update it. lets looks at the changes;
- line 17: add new Arg for application home folder
- line 60: let you customize your Airflow installation
- line 73–78: copy scripts, app and dag
Step 5: Scripts
create-user.py
let us to create a Airflow user. Update username, password and email.
entrypoint.sh
responsible to set environment variable required. You need to update Redis and Postgreql credentials and host names.
Step 6: Build the Image
We have everything we need in place to build the image from docker file.
>>> docker build -t airflow-on-ecs .
Step 7: Run and Test the image
If the PostgreSql database and redis are publicly accessable you can test your image by running
>>> docker run -d -p 8080:8080 airflow-on-ecs webserver
Step 8: Create an Amazon ECR repository and Push Image
The airflow-on-ecs
docker image is needed to be hosted on Docker hub or in AWS ECR. I choose to use ECR repository.
>>> aws ecr create-repository --repository-name [repo-name]
Get Login form aws ecr
>>> aws ecr get-login --no-include-email --region us-east-1
Run dokcer login
>>> docker login -u AWS -p eyJwYXlsb2FkIjoibXZWMEZjajJ2cEFVTHJXN2QwUzBMUFhiVlJ5cVV5TzJZbEh1... https://*****.dkr.ecr.us-east-1.amazonaws.com
Tag docker image
>>> docker tag nationkey-airflow:pro ****.dkr.ecr.us-east-1.amazonaws.com/nationkey-airflow:pro
push image into repository
>>> docker push ****.dkr.ecr.us-east-1.amazonaws.com/nationkey-airflow:pro
Step 9: Configure the ECS CLI
Before you can start, you must install and configure the Amazon ECS CLI. For more information, see Installing the Amazon ECS CLI.
The ECS CLI requires credentials in order to make API requests on your behalf. It can pull credentials from environment variables, an AWS profile, or an Amazon ECS profile. For more information see Configuring the Amazon ECS CLI.
Set up a CLI profile
With the following command, substituting profile_name
with your desired profile name, $AWS_ACCESS_KEY_ID
and$AWS_SECRET_ACCESS_KEY
environment variables with your AWS credentials.
>>> ecs-cli configure profile --profile-name profile_name --access-key $AWS_ACCESS_KEY_ID --secret-key $AWS_SECRET_ACCESS_KEY
Create an ECS CLI Configuration
Complete the configuration with the following command, substituting launch type
with the launch type you want to use by default, region_name
with your desired AWS region, cluster_name
with the name of an existing Amazon ECS cluster or a new cluster to use, andconfiguration_name
for the name you'd like to give this configuration.
>>> ecs-cli configure --cluster cluster_name --default-launch-type launch_type --region region_name --config-name configuration_name
Step 10: Create Cluster
The first action you should take is to create a cluster of Amazon ECS container instances that you can launch your containers on with the ecs-cli up command. There are many options that you can choose to configure your cluster with this command, but most of them are optional.
>>> ecs-cli up --vpc vpc-c7axxbbc --subnets subnet-a4xxbxf9 --security-group sg-e9bxxxa1 --keypair datalab --capability-iam -size 1 --instance-type t2.large --cluster-config datalab --force
Step 11: Update docker compose file with image url in ECR
In this step we create dokcer-compose file. In case you decided to not define amount of CPU and RAM for each container you can easily remove cpu_shares
and mem_limit
from compose file. if you decided to keep it make sure your instance type has sufficient CPU and memory.
I suggest to allocate at least 1 Gigabyte if you want an stable environment.
Replace the image with the image name you pushed into ECR repository. And replace FERENT_KEY
with result of;
>>> python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print(FERNET_KEY)"
Step 12: Create an ECS Service from a Compose File
Finally, we can create a service from the compose file with the ecs-cli compose service up command. This command creates a task definition from the latest compose file (if it does not already exist) and creates an ECS service with it, with a desired count of 1.
>>> ecs-cli compose --project-name PROJECT-NAME service up --create-log-groups
Step 12: Check running containers
let’s check the running containers by executing ecs-cli ps. You can copy and past the Airflow and flower address into your browser.
>>> ecs-cli ps
Step 13: Check containers log
you can easily check any container logs by using following command.
you can find task ids by running ecs-cli ps
>>> ecs-cli logs --task-id f7a715ff-2b5b-4a95-adba-00e95fc926c4 --follow
Step 14: Scale the tasks on Cluster
you can easily scale the tasks buy running;
ecs-cli compose --file scale 2 --cluster-config datalab