AWS Lambda Functions with 11TB RAM? Yes we can! (sort of)

Using self-terminating EC2 instances for heavy-duty tasks in serverless architectures

Jonas Peeck
Axel Springer Tech

--

In one of our recent projects, parts of our serverless processing pipeline was getting too computationally intense for what AWS Lambda functions can handle.

So instead of refactoring the whole system, we instead took an unusual route: We replaced the lambda function in question with a self-terminating EC2 instance.

Here’s what we’ve learned :)

Exceeding lambda CPU & Memory limits

While lambda functions currently have a maximum timeout of 900 seconds (15 minutes) they also have some constraints on computational power.

AWS Lambda functions can be configured to use up to 10GB of RAM which simultaneously controls the vCPUs that are allocated for that run.

From the AWS lambda manual: Memory and CPU limits

For most use-cases that’s plenty. But it can get you into trouble when your computational needs start to exceed these limits (as happened in our case).

What’s important to remember in this scenario is that serverless does not only mean zero-deployment work and pay-by-invocation, but also an events based model to trigger invocations.

So switching away from a serverless architecture for parts of the pipeline also would have meant refactoring the invocation logic away from a push- (=events trigger the computation) to a pull behavior(=Queues are filled and worker nodes listen to that queue).

Unless you find another way…

Self-terminating EC2 instances to the rescue

So yes, we could have probably refactored our too-big-for-its-shoes lambda function to run as an ECS task and listen to a SQS queue like a normal person.

We decided to take an unusual route instead.

To preserve our serverless architecture, where the individual parts of our processing pipeline were either triggered by a cron-schedule or by a file from the previous stage being written to S3, we decided to try and scale up the lambda function instead.

The lambda function in question was an aggregation step anyways — so nothing we’d want to invoke in a highly parallel manner. Plus we had written our lambda code with go and in a way that was locally executable. So it was easy for us to run it on EC2, instead of in a lambda context.

To make it work, we added three things to our aws-cdk based infrastructure code:

#1 Upload binaries to S3

First we had to make sure we have code to run on the EC2 machines. To keep it simple, we used the aws-s3-deployment aws-cdkmodule to upload our locally compiled go binaries to S3:

#2 Create EC2 launch template with user data script

Next, we had to find a way to run those binaries on our EC2 instances on startup.

Luckily, it is possible to define a startup-script for EC2 machines. AWS refers to this as “user data” and, like the rest of our infrastructure setup, it can easily be configured in aws-cdk code.

In this user data script, we accomplished four things: (a) Download our binaries from S3 to the local /temp folder (b) Run the binary to do its job and pipe its output into a log file (c) Upload that log file to S3 for debugging (d) Shut the machine down:

All we had to do next, was to bake that user-data script into an EC2 launch template, which we would use later to launch an EC2 machine from our original lambda function (the one we’re replacing with this much more powerful EC2 machine):

Notice how we’re setting the shutdown behavior to TERMINATE. This lets us call shutdown (standard unix command) from the user-data script to ensure the machine is terminated once our binary had completed its work.

#3 Launch EC2 instance from lambda function

With all of the pieces in place, all that was left to do was to launch the EC2 instance from a lambda function — we actually used the lambda function we wanted to replace to accomplish that.

In effect this meant that the lambda function we needed to replace was still triggered by the same event (an S3 upload) — in this case however instead of doing the work itself, it merely launches an EC2 instance with our previously crafted launch template:

Since we’re using the launch-template we defined earlier, everything the machine needs is taken care of: Access rights, machine type and the user-data (that shuts everything down after computation is done) were all previously baked into the launch-template, so now we can leave the machine to do its thing.

Learnings & Drawbacks

Overall this approach worked surprisingly well for us. The machines did their job and we didn’t have to worry about timeouts in that particular part of our processing pipeline anymore.

There are some drawbacks to this approach however:

Spin-up time is definitely larger. If a low response-delay is critical for your app, this might be a little bit too slow for you. At the same time this approach is hard to do when you need to actually pass down event data to the EC2 machine. In our case we just needed to run an aggregation that was scanning an entire S3 bucket anyways — so no runtime parameters needed in our case.

One annoying thing we learned is that user-data scripts are a real pain in the ass to debug. You constantly have to ssh into a machine and read logs to figure out why it didn’t work that time. Things like ECS task definitions are definitely much less headache-inducing. But once it worked, it worked flawlessly.

Pricing was actually no problem either. Since we’re shutting down the EC2 machines once they’re done, we were effectively still charged on a per-use basis (you’re still not paying for idle machine time). You want to make sure to always shutdown your development machines though — in our case we generated a pretty hefty bill by not noticing that we had 10x EC2 instances running over the weekend on our development system (we had disabled the shutdown command for debugging purposes).

Lambda Functions with 11 TB RAM?

So no, of course we didn’t find a magic hack to circumvent AWS Lambda restrictions to magically scale lambda functions beyond their computational limits.

But we did find a way to make an EC2 instance behave sort of like a lambda function: It only runs when invoked and shuts itself down after it’s done.

Overall this approach worked pretty well and comes without much setup. Once we had the user-data script nailed down it was a pretty minimalistic thing. No autoscaling, no load balancers, no ECS task definitions — just a simple EC2 launch template that was triggered by a lambda function.

And we could stick with our events-based serverless pipeline. Neat :)

— — — — — — —

Thanks for reading!

I’m Jonas and I work at Axel Springer — Europe’s largest publisher — where I’m uniting our 3000+ developers worldwide into one global Tech Community.

If you liked this blogpost, feel free to clap 👏🏻 or share it on socials!

https://www.linkedin.com/in/aGuyNamedJonas/
jonas.peeck@axelspringer.com

💼 Check out our open positions in Tech

--

--

Jonas Peeck
Axel Springer Tech

Founder of uncloud - the first cloud platform that configures itself