App Engine Startup time and the Global Variable problem

Published in

Google Cloud - Community

5 min readApr 13, 2017

As I said before, my friend (Carter) pinged me for help improving the performance of his site “Bait and Stitch”, which matches up fishing enthusiasts with local designers to create custom apparel for fishing trips. While it’s an awesome idea, the beta testers encountered cold-boot performance problems. Like most performance issues, the first step is to profile it properly.

The profiling had shown that the issue wasn’t with GAE itself (GAE standard cold-boot time is very fast). The problem was the code that ran when the instance was cold-booting, and not while handling other requests. Next step, of course, is to dig into the results of that data and figure out what was keeping Bait and Stitch users from getting up and running as fast as possible.

Too busy to read? Check out the video above for the summary.

Go jump on a log

Looking through the logs in the Google Cloud Console, we were greeted with a confirmation that there was something going on with cold-boot requests. The logs below show that the first set of 8–10 requests after a loading request are significantly slower that ones afterwards:

Looking at the timestamps in the logs, I realized that the issue wasn’t requests that came in immediately after the loading request, but rather at the same time as the loading request. The timestamps show that the slow requests arrive at pretty much the same millisecond as the loading request:

Given this fact, the way we should really be looking at this log data is that there’s two groups of requests : One that arrives at the same time as the loading-request (slow) and one that arrives afterwards (fast).

Given this shift in thinking, I suddenly realized what was wrong: Parallel requests.

Parallel requests and cold boots

If you have threadsafe enabled in GAE standard, then your application can receive parallel requests, which will all spin off onto separate threads to execute work on the same instance. This is normally really important, given the fact that it allows you to maximize your usage of the instance resources you have.

In this case, however, this is causing a problem, since something in the loading-request is causing the rest of the parallel request to block until the loading request completes. Check out the logs below, you can see that each parallel request gets subsequently slower while it waits; and in some cases, is 2x slower than the loading request itself:

But this isn’t normal behavior for parallel requests. Something in the code was causing those other requests to wait until the first one finished. Taking a hard look, the problem materialized pretty quickly: Carter was messing around with global variables.

Global variables and cold-boot time

When application code is imported, instantiated, and begins execution, there’s typically a block of global data, common functions, or state information which needs to be initiated as well. Most of the time, these systems tend to be global in scope, since they need to be used by so many subsystems (for example, a logging system.) In the case of App Engine, this global initialization work can end up delaying start-time, since it must complete before a request can be serviced.

Let’s consider the following lines of code, which do a database call (or in our case, an NDB fetch) to populate a global value which is used for the rest of the service lifetime.

Since this code is part of our global initialization process, as soon as the instance starts, this will do an NDB query for some data. The impact on startup performance of this small change is dramatic. Below is a graph showing the difference in cold-startup vs a cold-startup with this code running. The graph below shows the 95th percentile of time taken to perform a startup for 50 iterations.

You can see that for this simple test, the startup time was about 4x slower. To be fair, this initialization fetch only occurs on the start of this instance, subsequent requests (even on multiple threads) will share this data, and will not incur the overhead. Although boot time is still < 1 second, it’s now gotten 3x slower, for a single NDB fetch during this global path.

Lazy loading globals

With all this in mind, Bait and Stitch’s problem becomes very clear:

Loading requests have extra cold-boot time associated with them
Each request was trying to instantiate the same global variable, causing blocking contention across threads.
As such, the loading-request would acquire the lock on the global variable, take longer to complete the request, and all the other parallel requests would wait as a result.

The solution to our problem came from a very simple realization: not every one of the requests actually used the global variable. But since the global logic was occurring, no matter what, other requests ended up blocking on the initialization of this variable as well.

The fix was pretty easy at this point : Change the global to be lazy initialized based on the first instance that uses it:

In this scenario, the app still pays the cost (eventually) to do the NDB query, but the trick here is that it’s more performant when the getCDN function, is not in the main path. If this global variable is only used in only 20% of your fetches, then your general cold-start time will improve in the situations where you don’t have to do this global fetch until absolutely needed.

For our use case, this was pretty important, as you can see the loading requests typically didn’t use the global variable in computation:

Obviously this is a specific case, but you can see quickly cold-boot time can cause problems that do, in fact, impact your bottom line.

Every millisecond counts

For those folks who’ve worked with managed languages in situation where startup-time matters, most of this information is old-hat. But in the world of cloud computing, these small performance anomalies have serious consequences, not only for user-perception of performance, but also in pure cost: Having 100 loading requests a day, at 187x slower than regular requests starts to result in significant overhead in compute hours.