AWS Lambda: Unsafe at Any Scale

NOTE: I wrote this months ago and it has sat in my drafts since then. It contains some useful info though, so I’m going to post it as-is.


Let me start out by saying that I ♥ AWS Lambda. Lambda gives me a glimmer of hope that some day soon, rebooting and logging into a server will be a thing of the past.

Lambda leapfrogs containers (Docker) and all those other things that are supposed to make managing servers easier — but instead just add a new thing to worry about — and takes us into a new post-server world.

That’s a day I look forward to like retirement, because fuck servers and all the operational overhead they bring.

Security… Ugh.

Maintenance… Yuck.

Command line… Pass; and to everyone that says emacs/vi > gui… I hate you.

Even though that day is closer than ever, it unfortunately remains distant because of the fact that right now Lambda is unusable in production.

What is Lambda?

You probably already know, but in case you don’t… AWS Lambda let’s you run small, arbitrary programs/code in a scalable way and without the need to manage the infrastructure yourself.

Or how AWS describes it:

AWS Lambda is a compute service where you can upload your code to AWS Lambda and the service can run the code on your behalf using AWS infrastructure. After you upload your code and create what we call a Lambda function, AWS Lambda takes care of provisioning and managing the servers that you use to run the code.

The concept is new to AWS but has been around for a while from companies like Iron.io.

How it works

As far as I can tell, it is either built directly on top of Elastic Beanstalk (which is EC2+AutoScaling+some other sugar) or just very similarly designed.

From your perspective you put in code and it just works.

As for price, you get charged based execution duration and some other stuff I don’t feel like looking up right now.

It really is very cool service.

Common use-cases

  • Handle events triggered by other AWS services e.g. S3 object created -> Lambda function that transforms new S3 object
  • Cron i.e. lambda function that executes on a schedule
  • Manually i.e. you use the Lambda API to manually invoke a function
  • And more recently, API Gateway. You can use Lambda as the backend. (very cool doesn’t begin to describe!)

It’s a trap.

Lambda sounds great because it is, at least in theory.

Unfortunately though, right now it’s fundamentally broken to the point of being unusable due to one “feature”: Concurrency Limits a.k.a Safety Throttles.

From the documentation:

The throttle is applied to the total concurrent executions across all functions within a given region.

It goes on:

If your account exceeds the safety throttle at any time, any of your functions in the region may be throttled.

Purpose

AWS claims these limits are for your safety but in reality they’re for theirs.

Since you’re not billed per “thread” you have no incentive to not to process an entire queue backlog concurrently.

For example, if you have 1M tasks that need work and no per-thread cost… why wait? Just run 1M concurrent executions and be done now!

Lambda wouldn’t be able to cope with that sudden concurrency spike and that’s the real reason they limit (IMO).

You got throttled, son.

Once you hit that concurrency limit, your functions will start getting throttled.

Throttling means that an attempt was made to invoke a Lambda function but it was intercepted and rejected.

This rejection takes a few different forms. Again from the Lambda Limits documentation:

If Lambda functions are invoked synchronously […] (error code 429). If Lambda functions are invoked asynchronously […] retried for up to 15–30 minutes

Attacks and Outages

Regardless of the reason you hit the concurrency limit the result will be the same: all of your functions could be throttled.

All your microservices are belong to us.

This throttling means that anything with Lambda upstream or down could be impacted.

API Gateway backed by Lambda? End-users get errors.

Kinesis Stream being handled by Lamda? Not anymore.

S3 or DynamoDB events being sent to Lambda? Nope, but luckily they each have a 24h memory. Hopefully things don’t last longer than that otherwise, have fun recovering from that…

S3 outage turned cascading failure

For some, an S3 outage is an isolated annoyance that does not impact other services.

However, with Lambda, it can turn into a cascading failure that brings down the company.

Consider this scenario with an API Gateway + Lambda implementation that writes to S3:

  1. You ignorantly putObject; because there’s no alternative
  2. Maybe S3 isn’t completely down, but instead has extremely slow transfer speeds
  3. A timeout eventually happens; either on your putObject request, lambda execution duration, or API Gateway -> Lambda 10s timeout. (Meanwhile, your end-user is on hold wondering what’s taking so long.)

With this scenario and a busy API, you’ll quickly hit the Lambda Concurrency Limits with only a handful of request/second.

Old dog, old trick. Now with more Cloud™.

In the non-cloud world this type of attack is known as an HTTP Flood which is fancy speak for sending seemingly innocuous traffic to a weak target which will lead to the traffic stacking and eventually death.. errrr DoS.

As you know though, in our Lambda world, this means all functions in your account are impacted when any one is throttled.


What can be done?

By you? Very little.

This is really something that needs to be fixed on AWS’ side either with some kind of QoS/prioritization or with isolation i.e. concurrency limits not per-account.

For comparisons sake, Iron.io (seems like it) never had this problem and only introduced max concurrency later in a per-worker/function way. need to confirm this TK

You might be able to rig something up with CloudWatch but I’m not all that familiar with that service. My basic understanding though is: It can’t. At least not without significant work.

As I said… Lambda is unusable in production. Am I still going to use it in production? Probably.