Do’s and Don’ts of AWS Lambda
This post touches on some things I’ve learned using AWS Lambda to build the Apex Ping uptime monitoring tool, which is built almost exclusively with Lambda in all 10 regions supported.
Don’t substitute FaaS with writing good libraries
People often ask me how to test Lambda functions locally, my answer to that is don’t! Write and test libraries, integration with Lambda should come after you have an initial suite of libraries containing your application logic.
This applies to working with HTTP server as well, it’s generally a bad idea to place the bulk of your logic in HTTP routes themselves, not only are you creating more surface area to test, but you’re rendering that logic unusable outside of HTTP. This is probably one of the most common problems I’ve seen in development, for some reason there’s a tendency to gravitate towards placing this logic in strange places.
Some people argue unit testing is the least useful, I would argue it’s the most useful — reduce your logic until it’s easily unit-testable and you’ll have much less to test up the stack.
With this approach your Lambda functions should become only very small wrappers for your library code, and if you really need to see how they behave as a complete pipeline, push them to your staging environment and give it a spin.
Libraries are forever.
Utilize SNS for ephemeral work
Sometimes you have ephemeral tasks which can be safely discarded if critical errors occur. This is where SNS comes in handy, it’s not a full-on queue such as SQS, it’s a publish-subscribe model which happens to support retries and back-off as well.
Creating distinct stages between work gives you more flexibility to tweak the intermediary (SQS, SNS, Kinesis, …), gives you more visibility, and eases refactoring.
For example, my use-case with Apex Ping is performing “check” requests, and querying Elasticsearch periodically for alerting. Both of these tasks are important of course, but it doesn’t make sense to hold them in a queue indefinitely as they’ll be picked up the next minute. It’s best to retry a few times and give up in the case of critical errors.
Lambda imposes a concurrency limit (which may be adjusted via support), so it may be critical to use something like SNS in-between calls in order to retry when throttled. Many AWS services rate-limit like this, for example SES may have a low limit such as 15 emails per second, SNS is a great fit here since it’ll simply error and retry.
It’s worth noting that async invokes behave as if powered by SNS, and will perform up to 3 tries upon error. This may be fine for many cases, however SNS will give you more control over the retry policy such as backoff customization and so on. If this functionality is in fact built on SNS, you may see some function-level configuration for it in the future.
One critical thing missing here, is that you cannot limit the number of concurrent calls, so for example if you have 500 “jobs” to perform and you queue those in SNS, lambda will attempt to invoke all 500 concurrently and this may place a large burden on other services such as RDS. I’d like to see an option to limit this.
Define distinct pipeline stages
Suppose you’re computing weekly reports and they require expensive queries to be run before the email can be generated, if the task were to fail you’d be starting all over again, clearly this is not ideal. Creating boundaries between sub-tasks can be a great point to introduce a service such as SNS, SQS, or Kinesis to handle retries and back-off.
You’ll also gain additional insight by seeing latency and error metrics for each distinct function. This is where Lambda shines, it’s not a particularity good solution for websites or APIs, but it is fantastic for pipelines and data processing.
Enable SNS delivery status notifications
When you’re invoking Lambda functions via SNS you may experience some pain if you do not enable delivery status on the topic, if there’s an error invoking the function SNS will simply swallow the error.
Once enabled you can choose to log a percentage of successful invocations, or none, and more importantly the errors to CloudWatch logs.
This is a terrible default and I hope AWS changes it, but you’re stuck with it for now!
Utilize CloudWatch events to pre-warm functions
When Lambda functions are brand new or infrequently invoked they’ll become “frozen”, and require “thawing” before a call can be completed. This process typically takes within the range of ~1–2 seconds, leading to a poor user experience if it’s a user-facing call.
It doesn’t take much to cross the threshold of the function being “always available”, so I wouldn’t worry about this too much, however you can CloudWatch Scheduled Events to invoke the function periodically to keep it warm.
Keeping track of versioning
Versioning with a suite of several dozen Lambda functions gets a little crazy. Initially I tried versioning each independently, even keeping change-logs, as you can imagine this becomes unwieldy in no time.
I’ve had good luck with just assuming everything in master is deployed. I use apex(1) for managing Lambda functions which supports rollbacks in case anything goes wrong.
Ideally I think tooling such as ‘Serverless’ or apex(1) would keep track if exactly which GIT SHA is deployed, especially useful in the case of rollbacks.
Utilize micro-batching for I/O bound work
If your Lambda functions are CPU bound there isn’t much sense in batching, you’re better off using individual invocations — however if your functions are I/O bound you can save a lot of money with “micro” batching invocations to fewer invocations.
My use-case with uptime monitoring is a perfect example, performing these requests is about as I/O bound as you can get. Instead each “check” request being performed in a distinct Lambda function, I queue them in small batches and execute each batch in a function in parallel, saving an easy ~%66. A single Lambda function ends up doing the work that several would normally do, effectively just waiting for responses in this case.
Cheaper isn’t better
Pricing AWS Lambda usage can be a little tricky, but I highly recommend evaluating your work-load with different function sizes. More often than not you’ll find that the more expensive functions offering more RAM and CPU will actually save you money as they require less time.
Utilize CloudWatch alarms
When working with Lambda you’ll definitely want to utilize CloudWatch alarms to inform you when durations go beyond your expectations or throttling occurs.
Unfortunately AWS does not make the concurrency available to you as a metric, which means you cannot alert when you’re approaching the concurrency limit, so it’s best to ask support for a limit much higher than you require. Ultimately it’s AWS’ fault for not making this observable, so they should be ok with providing a high value, the last thing you want is your functions to scream to a halt due to a persistent increase in concurrency.
EDIT: The default concurrency limit is now 1,000 instead of 100.
If you can provide a statically linked binary for AWS Linux then you should be able to ship the binary in your function and run it. This is a bit of a hack, but it can work great for running things like PhantomJS in Lambda. This is what allows apex(1) to run Go in Lambda.
Running any web server in Lambda
Another trick you can do is to run a HTTP server within the Lambda function. For example take a regular Node or Go HTTP server, launch it in a sub-process binding to /tmp/server.sock (or similar) via unix domain sockets. Then proxy JSON representations of a request to the HTTP server, and convert the response to JSON on the way out, thus allowing you to run a “regular” server.
Once API Gateway supports full passthrough I suspect this will be how a lot of people run existing code in Lambda, though ultimately it would be nice if AWS had a more Lambda-like container offering.
Shameless plug: If you have a website, app, or API, and you have customers, check out Apex Ping for uptime monitoring and keep them 😊.