How to Build a Scalable, Cost-Effective Event-Driven ETL Solution using Serverless?

Vyas Sarangapani
Mar 8 · 6 min read

Problem Statement:

The typical workload on our system is based on user actions on data and it creates anywhere between 100– 100K insert(s)/ update(s) or deletes across different databases and media stores for every user action.

It is safe to assume that the user manipulates a large amount of data on most of the actions.

Given the above,

How do we enable end-users to create and monitor tasks that generate heavy workloads on the system, without having to make them wait for a long time?

We decided to focus on a few main goals from a solution perspective:

  1. Must be performant (fast).
  2. User to get live feedback on the operation(s) progress.
  3. It must be scalable.
  4. Implementation and operational costs must be minimized.

Challenge:

Given the above constraints, how do you design a scalable and cost-effective system with the least amount of effort?

Say, Hello Serverless!

Serverless technology provides teams the ability to compose horizontally scalable applications without having to engineer complex distributed systems.

Especially during MVP stages of a product, time, being the most precious resource, it is best spent investing in developing the core product rather than setting up systems of scale that cater to none.

Keeping the requirements above in mind, web sockets is naturally a good candidate for the UX, we leveraged the following AWS technologies to provide a completely serverless solution (runtime).

Runtime:

  1. Lambda(s) — Obviously!
  2. Event Bridge — Event Bus
  3. SQS — Buffering, Retry, and Lambda Triggers
  4. API Gateway Web Sockets
  5. S3 Lambda Triggers
  6. Dynamo DB Stream Lambda Triggers
  7. Cognito Authorization Lambda Triggers

Persistence:

  1. Dynamo DB
  2. RDS — Postgres
  3. AWS Elastic Search managed service.

Infrastructure as Code:

  1. Serverless Template.

Solutions:

There are several types of workloads when it comes to long-running processes, e.g — Order fulfillment, payments, etc. These are a good fit for microservice orchestration tools like AWS Step Functions.

More recently, AWS has added support for dynamic parallelism enabling the easier implementation of scatter-gather patterns. The main limitation here is that there can be a maximum of 25000 state transitions. AWS suggests circumventing this by using lambdas to trigger new executions once the execution history limit is reached.

AWS has also released Step Functions Express workflow to deal with the execution history limit but imposes an execution time limit of 5 mins.

Since we were not sure if all our workloads could be completed in under 5 mins, we went ahead with a custom event-driven ETL orchestration. However, we plan to test our workloads with Step Functions and possibly adopt it, which calls for a different blog.

Constraints:

AWS imposes a timeout restriction on the execution of Lambda functions to 30 seconds when it is connected to API gateway. This means, all your API(s) need to respond within 30 seconds. Ideally, any API implementation should be as fast as possible. The faster, better. Read how every 100 ms impacts user experience and consequently revenue here

Nevertheless, not all workloads can fit the 30 seconds paradigm with serverless. There are several scenarios that require making multiple system calls to complete a request and are often long-running. Also, Lambda functions that are not connected to the API gateway have a maximum execution limit of 15 mins.

Given that, how do we support workloads such as the one described here, in a serverless environment? Let’s explore!

Architecture:

As illustrated above, the first call to lambda processes the API request for the user task and returns a success message to the UI acknowledging that the task will be processed. The API also publishes a message on the event bus to be processed by downstream services.

AWS Event Bridge serves as a central event bus routing all the messages across different publishers and consumers based on rules. The choice of using Event Bridge against SNS came in because we perform message routing based on message content and keeping in mind the integration options with external SaaS Applications as well. However, this piece in the architecture can be easily replaced with SNS if requirements change.

Based on the type of the task, the event gets delivered to the respective Queue that is attached to a Lambda function by the means of a Lambda trigger.

The design of each Lambda function is idempotent, when a failure occurs, it is expected to be retried a preset number of times by the queue RedrivePolicy and after the count exceeds, eventually passed over to a Dead Letter Queue to handle the failures and for the developers to analyze and fix the issues.

The E, T and L phases of the pipeline are linear and synchronous. However, within each of the phases of the pipeline, there is dynamic parallelism implemented within the Lambda function or as a fanout lambda, dividing resource-intensive workloads through multiple lambda invocations and finally stitching them together.

Through each of the phases, as and when progress is made, each lambda posts regular updates to dynamo DB which is used as a trigger mechanism to update the progress to the client(s) via WebSockets.

Challenges Faced and Future Plans:

Most AWS services that either send or receive messages have hard limits on payload sizes, ranging anywhere between 64Kb and 256Kb.

The major challenge faced was making 1000s of connections to Postgres, concurrently. In a horizontally scalable architecture, the single point of failure is the DB instance (or cluster) which can only serve a predetermined number of connections at a given time.

Also, opening 1000s of concurrent connections to the database for each user task is not a good idea.

Serverless DB solutions such as Dynamo DB and Aurora come in handy. S3 can also be used to temporarily persist data between lambda executions.

The entire workflow makes very few calls to the DB, opening one connection pool for certain Lambdas, within the handler function.

// Anything declared here (outside the handler) might be shared across concurrent Lambda executionsconst knex = require('knex')({
client: 'pg',
connection: {
host: process.env.DB_HOST,
port: process.env.DB_PORT,
user: process.env.DB_USERNAME,
password: process.env.DB_PASSWORD,
database: process.env.DB_NAME
},
acquireConnectionTimeout: 10000,
pool: {
min: 5,
max: 10
},
searchPath: ['knex', 'public'],
});
exports.handler = async (event, context, callback) => {
// your app logic
}

By opening DB connections within the handler, it is ensured that the pool is not drained by another concurrent lambda execution and each lambda has its own pool of resources.

AWS has attempted to solve this issue by releasing RDS proxy. Read more about it here

Why this approach?

The workloads were previously running on EC2 Clusters and docker containers managed by AWS Batch.

  1. They were slow — Startup time for the job was very high (1+ mins)
  2. Irrespective of the workload, once a job definition is made with certain resources (RAM / CPU / GPU) it will always be provisioned with the same amount of resources.
  3. Queuing concurrent requests, resources in the cluster need to be available before a job is started. One job might be hogging resources it does not require while other user tasks would be waiting for resources. The user has to wait in this scenario until the resources are available and then the actual execution of workload beings.
  4. Handling the retry of atomic failures was difficult and verbose.
  5. Needless to say, cost! Since AWS Batch needs to have compute provisioned always, running our workloads on the fully serverless environment, reduces the total cost and allows us to scale infinitely.

You do not always need servers provisioned to perform simple batch tasks. With lots of serverless services in the offering, there might just be someway for every type of workload to be serverless.

Going forward, I will publish updates on changes made to this architecture such as adopting Step Functions, RDS proxy other interesting experiments on AWS.

Stay Tuned.

Shout out and thanks to Adrian Hornsby for reviewing this blog post!

Vyas Sarangapani

Written by

Tech Enthusiast. Backend Engineer. Dev Ops. AWS. Avid Trekker.

More From Medium

Also tagged Aws Batch

Also tagged Aws Batch

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade