Improving Cloud Function cold start time
Any time you are working with responsive cloud computing systems, cold-boot performance is something you’re going to have to consider. As we saw with Compute Engine, Container Engine, and App Engine, any time a user response comes into a cloud resource, it runs the risk of needing to do a boot-from-scratch, or cold-boot. This is problematic, since, in these scenarios, the response time is increased.
That’s why, when CAPITAL FUN approached me, asking about how to improve their cold-boot performance for Cloud Functions, I wasn’t surprised at all.
When does a GCF cold boot occur?
Functions are considered a “serverless” architecture, but nonetheless can still be at the mercy of cold-start times. The best way to figure out when you’re going to have to pay the overhead of a cold boot is to consider that you have one of two situations you face when calling your serverless function:
The “Hot” scenario your function has run recently, and the same machine instance on which it last ran is available to run it again. This results in the shortest round-trip time.
The “Cold” scenario the next call to your function will result in a new instance being instantiated to run your serverless function. This can happen for a handful of reasons:
- Your function has been deployed, but not triggered yet.
- Your function has been idle long enough for the function provider to tear down the resources used on the last call.
- Your function is auto-scaling to handle capacity, and creating a new instance.
In general, “Hot” functions will already exist, and be ready to handle your requests as fast as possible. “Cold” functions on the other hand, will have a longer response time due to the need to provision function resources prior to executing the code.
In order to get a baseline for the difference between these two, I created a new completely blank GCF, deployed it, and pinged it a few times. In the image below, you can see the red area is the “Cold” boot time for the function (12ms), while the blue box is the general response time for the “Hot” scenario (4ms).
Dependencies & cold-boot performance
Hands-down, the #1 contributor to GCF cold-boot performance is dependencies. When a module boots, node.js resolves all require() calls from the entry point using synchronous I/O, which can take a significant amount of time if the number of dependencies are large, or if the content itself requires a lot of linking.
So when I saw that CAPITAL FUN’s GCF module had 50 Dependencies (!!!!) I knew right where the problem was, and our profiling proved my point:
Getting this back down to something manageable took a few steps, and a lot of negotiation
Step 1 : Trimming dependencies
As a first step, we worked with CAPITAL FUN to trim out dependencies. Including a module may involve a subset of other modules. Its’ critically important to understand what sub-linking is occurring, as it can vary the startup performance of your function significantly. After a lot of negotiation, digging, and spaghetti code cleaning, we were able to get down to 8 packages, including google-cloud. This resulted in ~ 2x improvement in cold-boot performance.
Step 2 : Using the dependencies cache
One of the influencers of dependency resolution is weather or not the dependency exists in the cache or not. The dependency cache is shared across all GCF dependencies, as such, the most popular versions of a module can be reused across a lot of users and deployments.
Looking at CAPITAL FUN’s dependencies, their version numbers were all over the place. In some cases, their system was changing the version number between deployments, depending on various code and staging factors.
To address this, we unified the production code to use the most popular module versions requested by external users, since, these versions are expected to be present in the dependency cache, making deployments faster.
Step 3 : Lazy-loading
Further digging through the Node.js code for CAPITAL FUN, we realized that not all the dependencies needed to be loaded at the boot-phase of the function. Although it’s a non-standard practice, we’ve seen before how putting requires within the function body, and pulling them out only when necessary can help the performance of cold-boot situations.
This allowed us to require only what is needed at start, and then switch to an async version of require later on for specific request handling, bringing down the cold boot times even further.
The fix is in.
Working with CAPITAL FUN, we were able to reduce the number of dependencies, optimize the version numbers to increase dependency caching, and then lazy load certain modules. The result? Going from 20sec cold-boot time to 1.83sec, which is amazing considering their warm-response time is around 800ms or so.