GAE Startup time and the dependency problem
I was grabbing lunch last week with a good developer friend of mine who used to work in the games industry. She was lamenting about the limits of Google.com as a calculator:
“Google.com is one of the easiest calculators I’ve ever used. But I’ve been hitting its limits a lot lately. It won’t even compute the factorial of 1024.”
The result, was that she created her own version of Google.com calculator (even copying the CSS/HTML from the main site) called “Platy-Calc”. However her version ran on App Engine and evaluated the submitted math expressions using the python runtime (which can easily calculate the factorial of 1024).
To me, this was brilliant, and I started using Platy-Calc as my primary calculator immediately.
However, I started noticing that the first request of the day was always a little slower than subsequent requests. Seeing as I’ve been talking about cold-boot time a lot lately, I reached out to my friend and asked to do some profiling on the application.
An initial profile
A first profile of Platy-Calc’s cold-boot time showed pretty much what I was expecting: Higher than normal cold-startup time.
The code for Platy-Calc was pretty simple; There’s no global variables, no locking issues, and no database calls in the startup path. The only thing that was occurring, globally, during import time, that was any different than a normal “Hello World” was these imports:
Given the fact that we’ve already timed a simple “Hello World” app, and seen what startup times look like, these imports looked like a culprit.
Just as a test, we removed those imports and re-timed the startup process, just to see if GAE startup time had suddenly regressed, or if there was some other issue going on.
Thankfully, we weren’t going crazy, as our timings showed pretty much the same startup behavior as I had seen before.
At this point, process-of-elimination was clearly pointing to the imports as the source of the startup time issues. However it felt wrong to simply discard them outright; we needed more information.
As such, we put together a small code snippet (similar to this SO post) which timed each of the imported modules, to see which one may be causing us heartache:
The graph above tells two main stories:
1) The summation of imports is easily adding ~ 1.25 seconds to cold-boot time for the application
2) Obviously some imports take longer than others.
#2 is really the interesting point here, and is worth taking a second to discuss.
Woe is dependencies
Back in my day, we had to worry about how the complexity of C++ headers influenced link + compile time for our 20 million line applications. In modern functional languages, most of that overhead has been moved from compile time to runtime. A library/module can be loaded when you need it rather than at the beginning of the application. In order to support this, most functional languages have evolved a significant amount of flexibility when loading and instantiating code… Which can often lead to negative influences of your load time.
During cold-boot time, your application code will be busy scanning and importing dependencies. The longer it takes to do this, the longer it will take before your first line of code is executed. Some languages can optimize this process to be exceptionally fast, other language are slower, but provide more flexibility. And to be fair, most of the time, a standard application importing a few modules is less than impactful. However, when 3rd party libraries get big enough, we start to see them do weird things with import semantics, and can often mess up your boot time significantly.
For app engine, the perfect example of this issue is the Spring framework for Java, which heavily uses some aspects of the language that create less-than ideal startup overhead: Spring requires scanning all classes at load time (in order to determine dependencies) which seriously impacts instance startup time and thus your overall responsiveness. In-fact, just googling for “App engine Spring Slow” has the first 6 topics all relating to Spring influencing cold boot time:
While it seems I’m picking specifically on Spring, I’m not. Chances are, the larger, or more functional your 3rd party library is, the higher the chance that they’ve needed to do some weird stuff to get their code up, moving, and provide functionality.
A simple solution
Given that the imports were required for operation, our hands were a bit tied with respect to an elegant solution. Nonetheless, we came up with a few ideas to address the issue:
- Use a warm-up request — When the website is loaded from static assets, send a quick ping to the calculator service, which will import the required libraries in the same amount of time you’re typing in your equation.
- Lazy load imports — Python supports this operation by allowing global vs local importing of modules. So we could move all the imports to be function-local; That way, if you just want to do 2+2, you don’t have to wait to import the entire SciPy module. Only when we detect some specific keyword, we’d import those specific things.
- Prune the dependency tree — We found a lot of SO posts and tutorials about how to manually prune the dependency tree by either explicitly importing sub-portions of it, or some fun hacks that included mirroring the module and removing things.
Given all that.. we decided to do nothing.
Yup. Platy-Calc is not a huge, million QPS service. It’s just a small application used by a few friends who really need to compute large numbers from a command-line interface (for some crazy reason..). So waiting an extra second every few days for a query wasn’t that big of a deal in our case.
And as much as I’d like to beat the performance drum, truth is that sometimes, the result on the end user from a particular perf improvement isn’t justified, given the work needed to fix it.
So, remember : Every millisecond counts! (except in the few places.. Where it really doesn’t matter.. ;)
Want to know more about how to profile App Engine startup time?
Want to know more about GAE’s scheduler settings?
Want to know ways to avoid starting new instances in the first place?
Want to become a data compression expert?