7 AWS Lambda Tips from the Trenches

Fourthcast has used AWS Lambda to host Alexa skills since the early days and learned a lot of pitfalls and valuable lessons

Published in

A Cloud Guru

6 min readMay 13, 2017

The Fourthcast development team has been using AWS Lambda to host Alexa skills since the early days of the Alexa Skills Kit. Lambda, at times, can be like your neighbor’s pit bull. Sure it looks all cute an fluffy, but you know at anytime something viscous could happen. We’ve experienced many of those “Lambda bites”. Here’s what you should do to avoid them yourself.

1. Keeping your instances warm

You may already know that Lambda functions, when not used for a while, get recycled. The next invocation will require redeploying the function, which takes some extra time and thus additional latency to your users. We call this a “cold” invocation, as opposed to a “warm” invocation.

What you may not have known is that cold invocations are much worse if your Lambda function uses the network. The table below shows some timings of invoking a very simple lambda skill. Invoking a cold non-networked function takes 7 times as long as a warm one. But a cold function that uses network takes 15 times longer than a network-using warm function.

Even worse, if your function is inside a VPC, it can take more than 10 seconds to attach the Elastic Network Interface. We’ve had production skills timeout without ever really don’t a thing besides trying to talk to the network on a cold start.

To address this, Fourthcast uses a warming trigger on all of our Lambda functions. We attach the CloudWatch Events — Schedule trigger with a 5 minute period. Since Lambda functions go cold at around 7 minutes of non-use, this keeps the function warm pretty much continually, and significantly improves startup latency.

However, be warned that your function won’t stay active forever. Redeploying code or changing configuration will always cause a recycle. Also Alexa skills with heavy and concurrent use will require multiple deployments to run simultaneously. That second (or third) deployment will start cold. The warming function only keeps one deployment warm. Finally Lambda functions are automatically recycled periodically. We see a forced recycle about 7 times a day.

If you use the warming trigger, be sure to ignore events without the Alexa request key, and don’t rely on your invocation count to be meaningful for analytics anymore. Also don’t worry about the additional costs. Even at the biggest instance size, you’ll only be using up about 9,000 of your 266,667 free invocations allowed per month. If you use that much, you probably don’t need to warm your skill anyway.

2. Upgrade to Node.js 6.10

If you’re not a Node.js team, great … move on your merry way! But if you’ve got skills using Node.js 4.3, it’s time to upgrade.

Node.js 4.3 had several annoying bugs, but the worst among them is an OpenSSL bug that you wont’ discover until you’re running under load in production. This little doozy will put your entire function into a bad state. SSL connections will fail without cause intermittently, but only if you’re using DynamoDB. There’s a work around, but mostly, just upgrade to Node.js 6.10.

3. Finish What You Started

In programming models that support async operations (here’s looking at you Node.js), it’s possible, and sometimes easy, to finish your function and hand a response back to Alexa before everything has finished processing.

Aysnc operations that get caught up in the Lambda freeze/thaw are absolute death. They’ll pop back to life in some later invocation, but will likely be timed out. The tell-tale sign is bizarre timeouts with request id’s in Cloudwatch that correspond to requests issued hours earlier. These kinds of errors can often push libraries or OpenSSL into odd failure states that can only be resolved by forcing a redeploy, such as resizing the function.

Notice that the request id of the logged events are not the same as the request they were reported in. These are echoes from a failed prior invocation. Very bad.

You don’t want these kinds of Heisenbugs. Look carefully for anything that is executing in async, but not on the logical path to completing the Lambda function. The normal culprits for us have been caching puts and analytics. Since they’re not critical to the skill logic, you do not know that they didn’t finish. One such error caused us to believe we were logging analytics for 6 months before we realized maybe only one in three datums were actually stored. Also avoid any async work done at startup, outside of the handler.

While you shouldn’t rely on this, make sure that you’re using the callback rather than any of the functions on the context object to complete the invocation. This will do a best-effort to complete in-flight async operation before freezing your function.

4. Set Timeouts Shorter

Most libraries for async operations come along with long default timeout periods. In Node.js it is normally 2 minutes. With Alexa, if you’re not answering the user within 7.5 seconds, Alexa will respond with a failure message for you. A well-behaved skill should be much faster than that.

It’s much better to fail an async operation early and be able to tell the user that something is wrong in your own words rather than to get the dreaded, “There was a problem with the requested skill’s response” message. Also, debugging long-running calls that have been frozen, as mentioned above, are a huge pain.

In the case of interacting with an AWS service via the SDK, make sure you also set a low value for maxRetries, since these are cumulative with timeouts.

5. Avoid Global State

Because of container reuse, it’s possible to stuff data into global memory and ,with a good probability, it will be there in the next invocation. However, despite the fact that we’ve seen major tool libraries for Alexa leverage this, it should be strongly avoided. A deployment can be recycled at any time for many different reasons. Also there is no guarantee that requests within the same session will be routed to the same deployment.

I have also seen some debate about if it’s OK to store data in global state that will be used again strictly within the same request. Normally in Node.js (e.g. in an express based website) this is a huge red-flag, since that state could be clobbered by other interleaving requests. However, while I can’t find any documentation that guarantees this, in practice Lambda will not issue a request to an instance if another is currently in-flight. Because of this, using global state in a single request is possible, but I wouldn’t rely on it. It’ll mean weird bugs if you migrate off of Lambda.

In short, avoid using global memory. At Fourthcast we use global state only as a first-level cache.

6. Log Your Own Errors in Cloudwatch

Most our skills at Fourthcast catch any errors and return a user-friendly error message. If you do this, make sure that you log custom metrics for tracking these “soft errors” since Lambda’s invocation error metrics won’t be relevant anymore. At Fourthcast we use a custom Cloudwatch metric for soft errors, which allows as to attach an alarm and be alerted of high error rates.

7. Give It Some Room

Run you skill at one of the higher memory levels. The low memory instances are also allocated a smaller slice of processor and have very slow file and network IO. Many mysterious errors clear up when we’ve suddenly got more memory. With Lambda’s very generous free tier, you’re not likely to incur costs anyway, so go ahead, set it to 1536 MB.

Lambda is the perfect tool for hosting Alexa skills, but you’ve got to watch out for these pitfalls and “Lambda bites”. At Fourthcast, we’ve hosted all of our skills using Lambda, and don’t miss fiddling with servers at all.

Fourthcast is a service that takes your podcast, and turns it into an Alexa skill. 33 million people will own a voice-first device in 2017. What will they listen to?
Put your podcast on Alexa in just a few clicks! Get Started!