Lambda Internals — Part 2: Going Deeper

Exploring AWS Lambda Runtime Libraries

Gal Bashan
6 min readMay 14, 2018
Photo by Jim Beaudoin

Serverless development is simply the best. Double click, upload your code and you are done, right? Most people are more than happy to leave it at that. If you are not most people, and up for some Lambda exploration, this article is just for you.

In the previous article we got a shell to the Lambda container, downloaded Lambda runtime environment and discovered its components:

  • bootstrap.py — the python code wrapping our handler.
  • awslambda/runtime.so — a python-compatible shared object bootstrap.py uses it for, well, pretty much everything.
  • liblambda*.so — In his turn, runtime.so uses other shared objects. We will focus on liblambdaruntime.so, in charge of the heavy lifting in managing the Lambda logic.

We also had some fun messing around with bootstrap.py. This time we are going to roll up our sleeves and dive into the binary libraries of the Lambda runtime environment. We will explore Lambda’s billing system and (spoiler alert) have some fun messing with Lambda timeouts.

“Oh, the places you’ll go! There is fun to be done! There are points to be scored. There are games to be won.” — Dr. Seuss. Photo by Joshua Earle

Exploring the Libraries

The libraries (liblambda*.so) are compiled with symbols, so you can learn a lot about the libraries just by going over the symbols names. Also, runtime.so exposes a lot of these functions by importing and wrapping them, so a Python script (bootstrap.py in our case) can use some of them. How convenient!

Partial functions list from liblambdaruntime.so disassembly. Thank god for symbols.

One of the things I initially really wanted to check out was the behind the scenes of the Lambda’s billing system, and just by looking at the function names, I had some experiments I wanted to try. But first — let’s talk a bit about Lambda billing.

Lambda Billing

Lambda has a time-based pricing model, and without going into all the details, the gist of it is the longer it takes your Lambda to run, the more you pay. When invoking a Lambda, you can easily spot its beginning and end in CloudWatch Logs, as well as its duration and billed duration.

CloudWatch logs for a Lambda. You can see both the Lambda’s duration and the billed duration

However, there is a more complicated scenario. Consider the following Lambda:

On a typical run, the duration of this Lambda should be small (the billed duration should almost always be 100 ms). But what happens on the first invocation? Or on cold starts (where the module is re-imported)?

Lambda logs when a cold start occurred. The duration is much higher than a regular invocation

Empiric tests show that the duration of the first Lambda invocation (or cold start) contains the initialization duration. But I wanted to check how Lambda implements this.

Importing the Libraries

In bootstrap.py, there are calls to the following functions, imported from the binary libraries:

  • lambda_runtime.receive_start() or lambda_runtime.receive_invoke()— when a new trigger is received.
  • lambda_runtime.report_done() —whenever a Lambda is done

Now might be a good time to give some more details about the slicer I was referring to In the previous article. The slicer is the component in the Lambda that is in charge of allocating runtime to different user Lambdas, running on the container. These functions send a notification to the slicer (and other Lambda management components) when Lambda executions are done or receive information on newly initiated executions.

So after we identified the calls from lambda_runtime, and know what the slicer is, there was something I just HAD to try: importing the runtime library myself and having some fun with it! (these experiments are how I found out stuff on the slicer, mostly by reading the disassembly and some trial and error). The test I want to share with you is also the first I attempted: calling lambda_runtime.report_done() from inside my Lambda. This is the code I used:

The surprising thing I found was that when running this example, my code stopped after only printing “Beginning”. Then, when I triggered my Lambda again, it resumed its execution exactly from where we left off — and printed “After first done”! (I added the sleep because sometimes my Lambda managed to pull one “print” before the slicer paused it). This happened again and again until the Lambda execution ended.

Cloudwatch logs for the Lambda execution. Notice we have several request ids for the same Lambda!

So this made it definite for me — the slicer bills us for as long as our Lambda gets CPU time. That means that our billed duration is made of two parts:

  1. Module initialization time (only on first invocation / cold start)
  2. Our actual function duration

Avoiding Lambda Timeouts

Besides being very cool, this discovery has a practical (well… practical is in the eye of the beholder, but it is definitely interesting) use: handling Lambda timeouts! Consider the following Lambda:

I triggered the Lambda once, and it stopped at line 13. Then I waited some time and re-triggered it. The result was that the remaining time the context object’s method returned was 0, but the Lambda did not time out! The timeout of the Lambda was reset since this is a different invocation, and we now have doubled our Lambda’s timeout (and our AWS bill, of course)! A useful case for this, for example, might be a loop that processes many records and sometimes times out. We can now check if we are approaching a timeout, and if so call lambda_runtime.report_done() and wait for the next trigger to pick up execution from exactly where we paused!

The Cloudwatch log from the Lambda invocation. Remaining time: 0

Another thing that occurred to me while working on the issue is that AWS can supply a real feature based on this behavior, where a user can suspend his Lambda and resume from that same location in his next invocation. This might be useful not just for processing significant amounts of data and handling the timeouts in the middle. Another use case can be, for example, suspending your Lambda while waiting for an expensive IO / some other task results, instead of paying for your Lambda’s idle time! Will they do it? Don’t know. Is that ultra cool? Defo.

There is a downside to all of this, though. Since this is a hacky way, the next two following invocations of the Lambda will fail with an Amazon internal error. I am sure one can resolve this issue as well with a little effort, but for now, this was good enough for me.

Conclusion

We have learned much about AWS Lambda internals. We explored the binary libraries in the runtime environment and the Lambda billing system. We also imported Lambda runtime library, and used it to handle timeouts! However, there is still much to be discovered, on AWS and other vendors as well. Looking forward to the next challenges, if you have any requests — let me know!

I have also updated the open source library containing the different experiments I conducted, Hope you will find it useful!

Here at Epsagon we develop a monitoring tool tailor-made for serverless applications. Using serverless and interested in hearing more? Visit our website!

--

--

Gal Bashan

Director of engineering @ Epsagon (acquired by Cisco). Passionate for effective engineering leadership.