How to Build Highly Effective ETL Pipelines

Sahil Rai
Elucidata
Published in
6 min readFeb 21, 2022

Rohit Jagre| Sahil| Ram Sagar

Inconsistency in data inflows can be a huge challenge to engineers who oversee ETL pipelines. Last month, one of our biggest clients introduced 32TB of biomedical molecular data on our platform. Up until that point, our cloud infrastructure was processing a maximum of 15TB every month. Handling such a large surge requires a superior architecture that makes the best use of all the available tools. We implemented this. The result is, today our platforms can ingest more than 600 GB per day, up from 60GB per day, a month ago. That’s a 10x performance improvement. Not only that, but we also brought down the cost associated with handling such data by up to 90%.

Elucidata’s platform ‘Polly’ currently hosts close to 1.5 million datasets. Each dataset goes through various data processing layers before it comes onto the Platform i.e. cleaning & preparation, validation & sanity checks, and lastly, indexing & loading into our storage infrastructure. It is imperative that the ETL pipeline is designed to be fast, cheap, reliable, and easily scalable. In this article, we take a look at ways to achieve such exceptional performance from ETL pipelines.

What does it take to build an effective ETL pipeline

1| Solving for the inconsistent flow of data

ECS Cluster with custom scaling

ECS cluster with custom scaling lambda

An effective ETL infrastructure should be able to add compute resources elastically — from almost zero resources to multiple when the flow of data increases. Thankfully, in today’s world, this is a small challenge that can be solved by using the right tools, effectively.

What you need here is a “Container Orchestration” tool. There are many options available in the market — Kubernetes, AWS ECS, AWS EKS, Nomad to name a few. For Polly’s ETL pipelines, we have chosen AWS ECS because of the following reasons 1) Our remaining infrastructure is already on AWS 2) ECS is easier to manage compared to Kubernetes (AWS EKS).

As ECS is a managed container orchestration service that takes care of scheduling workers on ECS instances, associated with a capacity provider for auto-scaling, it makes an ideal solution for running our ETL workers. Since all we have to do now is invoke ECS to run the specified number of tasks. In case more tasks are needed, we can invoke ECS and the capacity provider will automatically add instances to incorporate those. But for our use case, instead of using the ECS service’s auto-scaling, we prefer to use AWS lambda that manages the scaling of the ECS workers.

Why do we use a custom scaling solution

The primary responsibility of the ECS service is to maintain a set number of replicas. When the load on the cluster is reduced, it starts scaling in i.e.terminating the replicas i.e terminate any of the replicas. But in any ETL pipeline, workers are not exact replicas since they’re tasked to process different jobs. Therefore, we prefer not to terminate them until the job is completed. Additionally, terminating at this stage could start a chain of events where new tasks are being created and killed simultaneously.

That is why we opted to write a custom scaling logic for ECS tasks, that would deliver on our need to scale when required, without terminating unfinished jobs. We try to emulate the target tracking strategy by continuously monitoring how many workers are required to process the available jobs by using a combination of Cloudwatch alarm + Cloudwatch EventRule + SNS + lambda.

2| Building a cost-effective ETL pipeline

While dealing with the management and processing of big data, infrastructure costs can skyrocket easily. It is essential to employ cost savings techniques at every opportunity.

We use two of them for Polly:

Employing AWS EC2 Spot instances

Spot instances are 90% cheaper compared to regular on-demand instances and are usually available at heavy discounts. However, they are not completely reliable as they are susceptible to being terminated by AWS with a 2-min warning. To tackle this, we have employed a fault-tolerant system (details below).

Scale to complete zero

Scale to zero is achieved by using Scaler lambda which terminates all the available instances when there is no data to be processed.

3| Incorporating checks and balances for pipeline monitoring

When the number of upload jobs in the pipelines increases significantly, it becomes challenging to keep track of the status of each job. To solve it, we have employed a NoSQL database (AWS DynamoDB) to store the status and related metadata of every job coming into the system.

Jobs table schema

Eventually, this DB can be connected to an analytics dashboard like AWS Quicksight.

4| Building a fault-tolerant architecture

As we opted to use Spot instances for running our ETL workers, it’s inevitable for a worker to be unexpectedly terminated between jobs. In such a case, the job should be retried a set number of times before being labelled as FAILED. To identify an incomplete job, we used SQS’s visibility timeout feature. Then a new worker is assigned to retry that failed job.

The Final Architecture

Final Architecture

Here’s a step by step guide to building an ideal architecture:

  1. S3 sends an event to controller lambda when files are created/modified/deleted.
  2. Controller lambda saves the metadata of the job in DynamoDB and marks the status as IN-QUEUE.
  3. Then, controller lambda pushes the job ID onto the queue (SQS) for processing.
  4. When there are new messages in the queue, a Cloudwatch Alarm (ApproximateNumberOfMessagesVisible>0) triggers an SNS topic which in turn triggers the scaling lambda.
  5. Scaling lambda starts spawning new workers based on how many messages are present in the queue. It does one other thing also which we’ll discuss in point 9.
  6. Workers start fetching job IDs from the queue. Metadata is fetched from Jobs DynamoDB and at the same time, the status is marked as PROCESSING.
  7. Workers start transforming and loading the datasets onto the storage layers.
  8. On completion of the job, the status is either marked as SUCCESS or FAILURE. If there are no messages to process in the SQS, the workers start self-terminating.
  9. On its first execution, Scaling lambda also enables a Cloudwatch Event Rule that’ll trigger it every 5 minutes. This cron job keeps on invoking Scaling lambda as long as the Cloudwatch Event Rule is enabled. On any execution, if Scaling lambda sees there are no more messages to process, it disables the 5 minutes Cloudwatch Event Rule and finishes the execution.

Conclusion

Today, there are many tools and services available, which can be used to create custom ETL pipelines. In this blog, we have demonstrated a clean and reusable pattern for a highly scalable ETL pipeline. We showed how Polly’s pipelines can accommodate for the inconsistent flow of data in a cost-effective manner while incorporating a fault-tolerant architecture.

What do you think of the solutions built by our engineers? Do you think there are better alternatives? Then come join us. We at Elucidata are always looking for engineers who are passionate about solving cutting-edge problems.

--

--