Building a Secure and Scalable ETL Pipeline on AWS using PostgreSQL, S3, and ECS.

Tarun Reddy Mallavarapu
4 min readApr 13, 2023

In the first part of this blog series, we discussed how to build an ETL pipeline that processes raw data present in an S3 bucket and transforms it according to the schema of a PostgreSQL database. We covered steps such as creating an S3 bucket to store batch files, setting up a VPC with private and public subnets, creating a Server less RDS cluster in the private subnet, containerizing the transformation job, setting up an ECS cluster, and dumping the output to the RDS instance.

In this second part of the series, we’ll discuss how to further enhance the security and scalability of the ETL pipeline. Specifically, we’ll cover the following steps:

Step 1: Choose the AWS region

When creating your resources on AWS, it’s important to choose the region that best suits your needs. For this ETL pipeline, we recommend using the North Virginia region, as it has the most extensive list of services and features available.

Step 2: Set up an S3 bucket

To store batch files, create an S3 bucket and configure it to encrypt data at rest. To ensure that only authorized users have access to the bucket, set the bucket’s access control list (ACL) to private. You can use the following command to create a bucket:

aws s3api create-bucket --bucket my-bucket --region us-east-1 --create-bucket-configuration LocationConstraint=us-east-1 --acl private --encryption AES256

Step 3: Set up a VPC

Create a Virtual Private Cloud (VPC) with a private and public subnet, a NAT gateway, and an Internet Gateway (IGW). The public subnet should have an Elastic IP address and be associated with a security group that allows incoming traffic on port 22 (SSH) for remote administration. The private subnet should have a security group that allows traffic from the public subnet and the NAT gateway, but not from the Internet. You can use the following commands to create the VPC:

aws ec2 create-vpc --cidr-block 10.0.0.0/16 --region us-east-1

aws ec2 create-subnet --vpc-id <vpc-id> --cidr-block 10.0.1.0/24 --availability-zone us-east-1a --region us-east-1

aws ec2 create-subnet --vpc-id <vpc-id> --cidr-block 10.0.2.0/24 --availability-zone us-east-1b --region us-east-1

aws ec2 create-internet-gateway --region us-east-1

aws ec2 attach-internet-gateway --internet-gateway-id <igw-id> --vpc-id <vpc-id> --region us-east-1

aws ec2 create-route-table --vpc-id <vpc-id> --region us-east-1

aws ec2 create-route --route-table-id <route-table-id> --destination-cidr-block 0.0.0.0/0 --gateway-id <igw-id> --region us-east-1

aws ec2 associate-route-table --route-table-id <route-table-id> --subnet-id <public-subnet-id> --region us-east-1

aws ec2 modify-subnet-attribute --subnet-id <public-subnet-id> --map-public-ip-on-launch --region us-east-1

aws ec2 create-nat-gateway --subnet-id <public-subnet-id> --allocation-id <eip-id> --region us-east-1

aws ec2 create-route-table --vpc-id <vpc-id> --region us-east-1

aws ec2 create-route --route-table-id <route-table-id> --destination-cidr-block 0.0.0.0/0 --nat-gateway-id <nat-gateway-id> --region us-east-1

aws ec2 associate-route-table --route-table-id <route-table-id> --subnet-id <private-subnet-id> --region us-east-1

aws ec2 modify-subnet-attribute --subnet-id <private-subnet-id> --no-map-public-ip-on-launch --region us-east-1

Step 4: Set up an RDS instance

Next, create a Serverless RDS cluster in the private subnet of your VPC. Configure the database to encrypt data at rest and ensure that the data is not publicly accessible. Use a non-root user to access the database for all transactions. You can use the following commands to create the RDS instance:

aws rds create-db-cluster --db-cluster-identifier my-cluster --engine aurora-postgresql --engine-mode serverless --scaling-configuration MinCapacity=2,MaxCapacity=4,AutoPause=true,SecondsUntilAutoPause=300,TimeoutAction=ForceApplyCapacityChange --master-username myuser --master-user-password mypassword --vpc-security-group-ids <security-group-id> --db-subnet-group-name <subnet-group-name> --region us-east-1 --storage-encrypted

Step 5: Containerize the transformation job

To run the transformation job, containerize it using Docker and upload the container image to Amazon Elastic Container Registry (ECR). The transformation job should take a location parameter that specifies the location of the input data. Use the following commands to create and push the container image:

docker build -t my-image .

aws ecr create-repository --repository-name my-repo --region us-east-1

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <aws-account-id>.dkr.ecr.us-east-1.amazonaws.com

docker tag my-image:latest <aws-account-id>.dkr.ecr.us-east-1.amazonaws.com/my-repo:latest

docker push <aws-account-id>.dkr.ecr.us-east-1.amazonaws.com/my-repo:latest

Step 6: Set up an ECS cluster

Create an ECS cluster in your VPC and define a task definition that uses the container image you pushed to ECR. The task definition should specify the input location parameter and the database connection information. You can use the following commands to create the ECS cluster:

aws ecs create-cluster --cluster-name my-cluster --region us-east-1

aws ecs create-task-definition --family my-task --network-mode awsvpc --container-definitions '[{"name": "my-container","image": "<aws-account-id>.dkr.ecr.us-east-1.amazonaws.com/my-repo:latest","environment": [{"name": "LOCATION", "value": "<input-location>"}, {"name": "DATABASE_HOST", "value": "<rds-host>"}, {"name": "DATABASE_NAME", "value": "<rds-name>"}, {"name": "DATABASE_USER", "value": "<rds-user>"}, {"name": "DATABASE_PASSWORD", "value": "<rds-password>"}],"essential": true}]' --region us-east-1

Step 7: Dump output to RDS instance

Run the task in the ECS cluster to transform the data and dump the output to the RDS instance. The data should be query able via the query editor of the RDS instance. Use the following command to run the task:

aws ecs run-task --cluster my-cluster --task-definition my-task --region us-east-1

Step 8: Secure inter-service communication

All inter-service communication should be done using IAM to ensure that only authorized services can communicate with each other. Configure the appropriate IAM roles and policies for each service and enable AWS Identity and Access Management (IAM) authentication for the RDS instance.

With these steps, you now have an ETL pipeline that processes raw data present in an S3 bucket and transforms it according to the schema of a PostgreSQL database in a secure and scalable manner.

--

--