How to process a lot of data with AWS Serverless

Kirill Chufarov
8 min readAug 24, 2021

--

When I looked at the latest successful project that our team delivered, I could not stop smiling. Sometimes you enjoy technological achievement and results. In that case, we built a data management tool that used only AWS serverless technology. The architecture was superbly efficient, scalable, and performed very well. It is the first article in the series: How to process a lot of data with AWS Serverless.

“Every medium needs a meme” — Ancient proverb

As usually happens, we started from customer existent problem, ETL was slow, data volume was very high and increasing, new requirements are coming. The first question was, what type of hardware/sizing we should use? I asked myself: how about using the entire cloud capacity for that task? Why limit yourself to one or five or ten servers when resources are available simply by request. Decided, we are going to use serverless technology and scale the infrastructure as we go.

AWS provides a wide range of serverless technologies, I will list some of the services below:

  • S3 — object/file storage, perfect storage for input/output of any serverless process
  • Lambda functions — a compute resource that allows running custom functions.
  • ECS on Fargate — a compute resource that allows running containers without server allocation
  • Step function — A workflow, orchestration tool to run dependent resources in a controlled fashion
  • EMR — technically server-based technology, but we used it as an ephemeral set of servers that allocated only during runtime, as well AWS allows run EMR on Fargate (via EKS — Kubernetes)
  • Glue Catalog — Metadata catalog that stores across AWS service definition of databases, tables, columns, schemas, locations, compressions, and more. A powerful tool for cross AWS-service usage.
  • Athena — Serverless engine for querying data stored in S3, presto based.

For the solution that I mentioned at the beginning of the article, we selected all of these AWS services, let me share with you a couple of patterns that we used to develop a scalable solution (to run thousands of lambdas and hundreds of containers per request). In the final article we will pull every part together into one coherent picture.

A unit of serverless work, timeout, and deployment problem.

Let’s begin from the bottom of the pipeline, one unit of work for our case was a “job”, it is a specific data processing code that can brings data from 3rd party partners, processes that data, remaps, decodes, recompresses it, loads into target destination and update metadata store. A typical job is the lowest denominator of activity and could not be split into the lesser job. It could run from 5 to 3000 seconds, depending on a particular type of activity or input data file sizes. Our team decided to run these jobs in lambda functions whenever we could. So we prepared processing lambda functions that can run these types of jobs (the simpler your lambda, the better, easier to managed and update). As you know lambda functions have timeout limitations, it can run up to 15 minutes. We took that constraint into consideration and discovered that in the project, 95% of the jobs are finished in less than 15 minutes. What did we do with the rest 5%?

Well, run it in a Container service: AWS ECS. In practice, I do not recommend running anything that needs more than 8–10 minutes in a lambda function. To solve the timeout issue, we specially organized the service code. The same function were deployed in the AWS Lambda function and the AWS ECS container. One code and logic was used with a wrapper that allowed to run it either in Lambda or in a Container.

You can do that quite simply in python, use the same lambda function code that you deploy into AWS Lambda and build the container image. In that case, you will use different deployment models, but one source code and logic. See an example below:

We created a file with a lambda function that we deploy to AWS Lambda. Also, we created a python container code wrapper (see Line 7) that calls lambda function. Note, we pass lambda inputs as environment variables into a container. To pass JSON directly without any reformatting, we did base64 encoding for an event variable, after receiving it as an environment variable.

That pattern has one caveat, beforehand you have to know how long a particular job will run. In this case, we had a simple if/then/else rule to identify where to run the processes. In your case, it can be the more complicated rule. It does not change my recommendation to run almost everything in lambdas and use ECS for long-running jobs. A piece of additional advice could be to split the job into even smaller steps if possible.

How easily to scale up with serverless?

Now, we have our job processing mechanism deployed into AWS Lambda or ECS. You can ask: how to run it in a controlled fashion on a large scale? Frankly, we had limited time to implement such a “coordinator” service. Let me list our options:

With constraints on the list of tools to select from and limitations in resources, we decided to use the ECS container as the hosting service for the coordinator. The motivation behind this was quite simple: we can better control execution, error handling, consistency, and scalability in a custom python service.

First, to call AWS services from python, you want to use boto3, a great library with support from AWS and the community. To do a boto3 call, you set up a session that has an individual socket connection with AWS API entry point, later you do a call of the AWS service and:

  • in the case of Lambda, you call lambda function and wait for a response (sync call), documentation
  • in the case of ECS, you call a “run task” and then request for the status (async call), documentation

Where is scalability in that scenario? It comes from async routine call. The simple explanation would be:

  1. Create a coroutine executor pool, specify maximum parallelism (number of “sessions”), documentation
  2. Create an event loop (await / async mechanism to pass over and control events in python), documentation
  3. Run the event loop until all async tasks are completed, collect results, documentation

A Naive representation of the pattern:

Code snippet provides an example of how to scale with AWS Serverless.

Line 5 and line 11 define boto3 calls of AWS Lambda and AWS ECS services.

Line 17 defines a function that handles the following:

  1. Line 18 gets created async loop (see below), later we will add tasks into executor in that loop;
  2. Line 20 loops over payload or jobs;
  3. Line 21 decides where to run that payload;
  4. Lines 22 and 27 add a task into loop executor (one payload passed for execution at the time);
  5. Line 31 waits until ALL executions are finished;
  6. Line 33 collets all results from all passed tasks into the async event loop. Note that parallelism of execution is defined during the creation of the executor of the loop (Line 37);
  7. One important thing to mention, that each call should be done in its session, meaning you should create a connection to AWS cloud for each call to make a coroutine not to block each other async runs.

Line 36 defines a function that creates an executor with a specific number of workers, creates an async event loop (a mechanism to control execution), and calls an actual routine to run tasks.

Finally, Line 43 defines the function that process input request, in advance prepares a list of jobs to run and pass it to the “do_serverless_tasks”.

Coordinator calls hundreds of AWS Lambda and AWS ECS in parallel and waits for the completness of these calls, it goes throught list of tasks until it finish processing the requests. The scalability of the approach wholly depends on:

  • how granular is your job, split input tasks into smaller pieces, it will improve performance (for example, one data file to process, set of rows to load)
  • how many lambdas you would like to run, a default limitation on lambda parallel execution per account per region ~100, you can request an increase from AWS Support team, I saw AWS accounts with >10K lambda quota to run.

An example of a task to process in serverless scaled to 100–1000 sessions: Imaging that you have a requirement to download data from a 3rd party and recompress it or convert from CSV to Parquet. Typically you would write an ELT process that downloads data in a loop and decode/recompress it. Or do a cascade of such ETL processes. But with serverless with practically unlimited scalability, you can write one function (unit of work) and scale it with these patterns to 1000 sessions running in parallel and finish work in minutes, where a traditional approach easily could pass 6 hours of execution. You can imagine the impact of such serverless patterns on data processing in a typical organization.

As you can see, there is no limitation on how many sessions you can run, practically the larges constraints would be: how many sessions you can open with the boto3 library to run lambdas/ECS in the async way. Our team experienced hundreds of parallel execution without any problems. To scale beyond 100–1000 concurrent jobs, use two following strategies:

  1. run multiple coordinator sessions (instances of code that calls routines)
  2. use async boto3 library https://pypi.org/project/aioboto3/

In the next article, we will touch on the first point of the mitigation strategy we will talk about control mechanism/workflow.

Conclusion

We considered patterns that improve the data processing rate and performance, with no bump in costs. I encourage you to use serverless, process data faster and deliver it to a business. It will make a difference, increase efficiency, provide a better service to customers.

In the end, we serve customers; their experience what is important.

To be continued.

--

--