AWS Lambda Cold Start Time From Infinity to Zero
Acknowledgments
This article would not be possible without the valuable input and help from Bryan Kimball. Thank you Bryan.
Synopsys
The Serverless Lambda Cold Start issue is a huge pain for many. Mitigation approaches are costly.
Being mostly a .Net developer myself I was ignorant of runtime features offered by AWS, which are neither for .Net nor Node.
Until one day, I finally looked at Java runtime and discovered the SnapStart feature there. And just a few seconds later this:
I felt very “offended”, “discriminated against”, and “frustrated” because my favorite framework, .Net, could not take advantage of the improvements in the startup performance outlined. 🙁
But, like many engineers out there, the best way to get us moving is to tell us that something is not going to work or it is impossible — and we will try it right away.🙂
After multiple failed attempts using different technologies such as Custom Runtimes, CRIU, Firecracker, and unlimited combinations thereof, the solution was just a few dozen Docker builds away.
Without further delay, let us explore the solution together. (experienced AWS users may consider reading this article backward from the bottom up)
The Prerequisites
To follow this demo one will need to have the following:
- git client
- Docker Engine
- An active AWS account with permissions to deploy and execute Lambdas and access to CouldWatch logs.
- A text editor that can connect to a running docker container and do basic file changes. Visual Studio Code with the Dev Containers extension can be a great choice for many. Or if you are familiar with Vim, it will be pre-installed when we build the image.
At time of publishing the demo git repository is only supporting Amazon Linux 2023 x86_64 and dotnet 8.0 as a build runtime. It can easily be adjusted to support arm64 and other frameworks e.g. Python, Node etc. (help from respected interop gurus is very welcome)
The Demo
For simplicity: please follow the steps in copy and paste fashion.
Step 1: Clone The Demo Repo (OS shell)
git clone -b article https://github.com/gitrealname/aws-lambda-snapper.git
The cloned branch is ”article” which is not polluted with potential contributions to (main).
cd aws-lambda-snapper
You may consider dropping your AWS credential files into the .aws folder, so that they are available for the following steps, or you can add them later when you connect to the running container. (please, don’t forget to do it)
Step 2: Build Container Image (OS shell)
docker build -t snapper:latest .
Step 3: Run The Container With The Shell (OS shell)
docker run --rm --name snapper-container -it snapper /bin/bash
Step 4: Build and Deploy Lambda (container shell)
cd ~/SnappedNetLambdaTest
sam build
sam deploy
Don’t exit the container shell, we will need to return to it shortly!
Step 5: Test Lambda (web browser)
Open the Internet Browser of your choice, sign into AWS Console, and navigate to the Lambda Home Page. (The URL may differ depending on the region of your AWS account, please adjust as needed, please apply the same for every AWS link in this article)
- Find and “step into” the deployed lambda
- Navigate to the Test tab and click the Test button
- Click “logs” link, it should open the Cloud Watch group for the lambda.
- Click on the latest log stream and observe
We “fired” the lambda, it went through the bootstrap (Cold Start), and executed the request handler.
Please note the bootstrapping time (~5 secs), we will need this info for later comparison.
Now, I would like you to step aside, take a cup of coffee, or take a smoke break (pick your poison🙂). It is harder to keep lambdas Hot, but really simple to get them to Cool Down. The only thing needed is a little time, about 5 mins in total.
Are you back? … Great! Let’s continue:
Click Test again on the lambda page, and observe the Cloud Watch log for the execution (step into the latest log stream). You should notice the same ”Cold Start” pattern.
Step 6: Build The “Magical” Lambda Layer (OS shell)
- Start a new native OS shell, keeping the container shell that we started back in Step 3 running
- cd (navigate) into the repo folder and execute:
docker build --output . --target copytohost .
Once complete, you should see the following in the bin folder.
Step 7: Deploy The Layer (web browser)
- Duplicate Internet browser’s Lambda tab (see step 5)
- Navigate to S3 and then Create or select the bucket of your choice and upload the generated layer file.
- Select the uploaded file and click Copy S3 URI
- Navigate to Lambda > Layers create a new layer (pick your name and description), and paste the S3 URI from the clipboard. Other configuration parameters must match the picture.
- Select and copy into the clipboard the latest Version ARN value (it can be found at the bottom of the screen).
- Switch back to the Lambda tab; And on the lambda’s Code tab, in the Layers section; Click Add a layer
- Choose Specify an ARN, paste Version ARN into the respected field, and click Add
Step 8: Let The Fun Begin (web browser)
All the hard work is behind us, and we are about to see its results. Before we continue there is one more VERY IMPORTANT!!! step while on the web lambda’s page:
Under the Aliases tab; click on the live link
Step 9: Small Adjustments To Lambda’s Deployment Configuration (OS/container shell)
Use Text Editor connected to the running container (or vim inside of the container shell).
For Visual Studio Code users:
Click the connect button (at the bottom left corner); Choose “Attach to Running Container…”; Select “snapper-container” (expect a new window to popup); Once it loads, File -> Open Folder… and select “/root”
Make the following adjustments to ~/SnappedNetLambdaTest/template.yaml (the same as /root/SnappedNetLambdaTest/template.yaml):
- Comment out line 18 (Runtime: dotnet8)
- Uncomment lines 19–21
- Uncomment line 28 (AWS_LAMBDA_EXEC_WRAPPER: /opt/snapper4net_wrapper.sh)
- Save the file
The final result should look like the following:
Step 10: Build And Deploy (container shell)
Please make sure your current directory is ~/SnappedNetLambdaTest and execute:
sam build && sam deploy
This time deployment takes a little longer than usual
Step 11: Observe Cloud Watch Logs (web browser)
Refresh the Cloud Watch tab and navigate into lambda’s log group (in case your browser is still pointing to the previously observed log stream)
Even though we have not explicitly executed the lambda, we will see a new stream record. Let us look inside:
You should be seeing a few new messages with SNAPPER-WRAPPER prefix (compared to the previous log stream). Otherwise, it looks very similar, including the execution time (~5.5 seconds). And lambda went through Cold Start (as it went via the bootstrapping phase). The “Request” messages are also absent. (remember we have not done any requests yet)
It’s time to take another 5–6-minute break. (to let the Lambda cool down)
SURGEON GENERAL’S WARNING: Cigarette Smoke Contains Carbon Monoxide 🙁
Step 12: Magic Is About To Happen (web browser)
Switch to active Lambda’s browser tab; refresh the page (you should still be on live alias !!!); and on the Test tab click the Test button.
Switch to the Cloud Watch log page; refresh it and you should see a new stream record. Let us see inside:
BINGO! (pay attention to highlighted areas; compare with the previous result ) (re-posted below for your convenience)
Restore time is not constant and can be somewhere in the range between 3xx to 10xx milliseconds. From my observation, it seems to be lower under heavier loads. I had no chance to experiment with provisioned lambda memory (I speculate that it may be the biggest factor of comparable restore speed) nor dynamically allocated memory yet.
The Technical Discussion
What the heck just happened?
You might have noticed already that there is no (unfortunately 🙁) ”Magic” in the layer. It just contains the desired runtime (.Net in this case) and a few extra pieces to get the runtime running. Please see how that layer is built here.
And also I don’t expect the companies to allow “Magic” on their premises anyway 🙂. Everything must be legitimate, and we do comply 🙂.
The second part is the lambda wrapper script. You can read about wrappers on the official AWS site.
And the whole flow looks like the following:
- AWS fires the VM (I presume Firecracker in this case) and attempts to execute the JVM.
There is one more intermidiate stage/step involved before it executes JVM etc. I skipped it for simplicity. You can find an example of one for .Net AWS runtime here.
- Our wrapper intercepts the call and executes the .Net framework executor/host (see snapit() and call_net() functions)
- Which in turn, loads and executes the Runtime Delegator
- Finally, the Runtime Delegator loads the standard and official AWS lambda Runtime, which works all the way thereafter.
Once lambda bootstraps, AWS runtime calls Next and AWS takes a snapshot of the entire VM. And uses that snapshot every time a lambda instance needs to get Hot, instead of re-running bootstrap code again.
There is only one small part missing compared to Java-specific lambdas: the Runtime Hooks. They can be easily simulated but I found that it is not worth doing as it adds around ~1.5 seconds to the restore time. But so you know, there is a secret invocation endpoint restore/next (see around line 68)
This is all there is to it.
Final Words
To My Fellow Long Time Cold Start “Sufferers”
No comments 🙂
To Data Scientists And AI Enthusiasts
This can open a new era for AI model hosting. As long as your model fits within 10 GB and doesn’t require GPU for inference, this approach may save you a lot of money! (I wonder if we could fit GPT-4 into a lambda... what about GPT-3 or at least GPT-2 in a worst-case scenario? 😉).
When I have time, I plan to implement Python runtime support. However, I hope that someone more experienced in Python interop will do it sooner. Contributions are welcome. Thanks.
To Amazon and Microsoft
It would be nice if this feature were available out of the box to save many people the hassle, even if offered under the “experimental” umbrella.