Reviewing the AWS Lambda nodejs10.x runtime

Last week AWS announced official support for Node.js v10 on Lambda, which is great! Or at least, it will be once it stabilizes a bit… Here’s what I found after digging into the code.

Background

I run the docker-lambda project which allows you to execute a docker container that’s a replica of the Lambda environment (also used in the AWS SAM CLI and Serverless framework among others) — so I have a detailed understanding of the Lambda runtimes and the code they run on. I also maintain a custom runtime implementation of Node.js v10 (and v12), which we’ve used for hundreds of millions of invocations at BDG. All this to say when I looked under the hood at the code of the official v10 runtime, I have some idea of what it should look like, and it gives me the impression you might want to hold off using it for now.

The code

The code which calls your Lambda function can be found in /var/runtime on a nodejs10.x Lambda, which is laid out like this:

bootstrap is a shell script that launches node, spawned by /var/rapid/init in the same way that custom runtimes are. It sets up the NODE_PATH env var and is mostly unremarkable except that it currently contains a little bug:

The bug being that the environment variable that it’s launched with on Lambda is actually AWS_LAMBDA_FUNCTION_MEMORY_SIZE, not AWS_LAMBDA_FUNCTION_MEMORY_LIMIT — so that bit of logic to setup the node args of --max_semi_space_size and --max_old_space_size does nothing, and the args aren't passed at all (unlike the other 8.10, 6.10, etc runtimes that do pass them). The AWS team are aware of this and I've been told a fix should be pushed out soon.

Onto index.js!

The log patching done by enableLogPatch() is similar to the log patching done in the previous 8.10, etc runtimes — but not quite the same. As with the other runtimes, it patches the native console.(log|error|warn|info) functions to prepend tab-separated metadata to each call (not a huge fan of patching core functions like this, but Node.js Lambdas have done this since 0.10). The nodejs10.x runtime inserts another field though (for log level, INFO/ ERROR/ WARNING) so any existing parsers will break (and indeed the AWS web console doesn't expand JSON objects in the same way anymore).

The mixed quote styles (some single, some double) put me off a bit — easy to setup a linting rule for that. This continues throughout the codebase with no pattern that I can see — both 'utf-8' and "utf-8" appear in RAPIDClient.js. Unlikely to cause bugs, just rare to see in well-written JavaScript code these days and a sign that perhaps there's no linting tool in use here (or that it's very light on rules — some other inconsistencies in the codebase like !== vs != for non-nulls, typeof(x) vs typeof x, {x} vs { x }, etc).

Given there’s an uncaughtException handler, an unhandledRejection handler could also be added — it would clean up some of the Promise error handling code as we'll see later on.

The main issue I have with this entrypoint though is the final process.on('beforeExit', ... line - that is, it waits for the event loop to be empty before processing the user's handler function. This is intended to deal with some complex callback logic we'll see later, but it means the handler function won't be called until any top-level async work has completed. In certain circumstances it means that the handler function won't be called at all!

Let’s say we had this handler file:

The setInterval call will start running as soon as the handler file is required by the UserFunction.load() call above — but it also means that beforeExit will never fire, because the event loop will never be empty. If you try this out on Lambda, your handler will never be called and will just timeout eventually. AWS has acknowledged this issue and a fix should be on its way.


This is basically the beginning of the awkward async state-machine dance that happens in the runtime, continuing with the runtime.scheduleIteration() call:

The setTimeout(..., 0) is there (presumably) to schedule this function at the start of the next event loop (I must admit it's unclear to me why this is necessary). Arrow functions pass this, so there's no need for the that = this dance. You also don't need to pass functions to then if you don't need them, so you could just use null as the first argument instead of allocating an empty function. Or you could just get rid of the whole thing and let unusual errors like this be handled by unhandledRejection.


We’ve already waited for beforeExit, now we've waited for a setTimeout(), finally we get to call the handler:

It sets up the context and callback, calls the user’s handler function and then deals with the result. At this point I should point out the callbackWaitsForEmptyEventLoop property that exists on the Lambda context. IMO this is a very unfortunate legacy — for some reason when callbacks were initially introduced in Node.js Lambdas (the original API was just context.(done|succeed|fail)) AWS decided that by default, instead of returning from your Lambda function when you call the callback, only return when the event loop is empty. Why? I'm not sure — maybe to deal with poorly written callback code? It's certainly very un-Node.js-like in any case. To turn off this unusual behavior you explicitly need to set context.callbackWaitsForEmptyEventLoop = false.

So with the code above, if the result isn’t a Promise and callbackWaitsForEmptyEventLoop is false, then callbackContext.succeed() is called immediately with the result of the handler() and anything passed to the callback asynchronously (which is the usual way callbacks are called) will be ignored completely.

Putting it in plainer terms, if you had a Lambda function like this:

It would be completely broken. As intended, it calls the callback with the result it wants to return in the Lambda (the handler function itself returns nothing, ie, it implicitly returns undefined). The problem is, handleOnce() will use the handler function's return value as the result and call callbackContext.succeed(undefined) before the callback is even handled — this is eventually stringified into "null" and that will be the result of the Lambda invocation, not { success: true } as we wanted. This has caused some other bugs yet to be acknowledged by AWS.


The way the callback function is created in CallbackContext.build in CallbackContext.js also bears scrutiny:

Looking at that last comment we can immediately see why the buggy beforeExit handler was setup in index.js — it's to cover cases where waitForEmptyEventLoop is true (which maps to context.callbackWaitsForEmptyEventLoop).

But the more worrying thing that jumps out to me about this code is that it appears to be just executing asynchronous commands ( postError(), scheduleNext()) one after the other, as if they were synchronous.

Popping open RAPIDClient.js (where the client.post...() functions reside) shows this (trimming mine):

Lots of alarm bells going off in my head with this code:

  • Errors will be rethrown in a different context, so can’t be handled by any caller of this function.
  • There’s no checking of the HTTP status code of the response (as there is in other runtime implementations), so if there’s a problem handling the response, the user wouldn’t know.
  • There’s no callback and nor does it return a Promise — so there's no way to know when this call has completed — the no-op response handler reveals as much. This means that multiple calls to this function in succession could very well end up overlapping, with HTTP calls potentially ending up out of order if some are sent faster than others.

Now you could argue that this would be less worrisome if it were used for something non-critical, like pushing logs out-of-band or similar — but in this case it’s used to post back the response of your Lambda function. No wonder the scheduling code with setTimeout() etc is complicated! If you ever write an HTTP client in Node.js, please don't write it like this.

How should it be written then?

I honestly think the best way to write easy-to-understand async-heavy code in Node.js these days is to use async/await and Promises.

In the specific case of needing to cover the legacy behavior of callbackWaitsForEmptyEventLoop, context.done(), etc, Promises give you a way of ensuring that only one of these mechanisms “win” each invocation. Coupled with async/await , they also allow you to build up an event processing loop that’s easier to understand than using events and callbacks (or not even using those!).

The entire processing and callback logic from node-custom-lambda, supporting all legacy use cases, can be boiled down to:

By wrapping the user’s handler in a Promise, we can always rely on awaiting it, which vastly simplifies the surrounding logic. The only awkwardness comes from the handling of callbackWaitsForEmptyEventLoop where we need to restart the loop on beforeExit, but there’s little way around that and it’s vastly simpler than many of the dead-ends in the (current) official runtime.

Conclusion

I’m hoping in fixing some of the bugs users have been finding in the runtime, that AWS will use it as a reason to also simplify some of the bootstrap code.

Until then, I’d hold off on updating to this runtime — especially if you currently rely on the legacy callback behavior (I’ve seen fewer issues if you’re only using Promises).

Nothing like thousands of users hammering your code to get it into shape huh? 😉