App Engine, Scheduler settings, and instance count.

Published in

Google Cloud - Community

9 min readApr 27, 2017

After deferring a lot more fetches for the Cloud-Clouds-View application to the client, I was able to help reduce the number of instances getting spun up significantly. But this started to make me worried, because I felt like I was just patching over the real problem. Truth be told, I still didn’t know why those instances were being spun up, or why 4 seconds per request was having such an impact on the instance count.

Like most engineers, when I don’t know something, it starts to gnaw at your brain until you finally get a chance to sit down dig into it. (I’m sure there’s a zombie joke in there somewhere..) In my case, I decided to spin up my profiling code and spend a little time getting familiar with the scheduler settings inside App Engine.

Too busy to read? Check out the TL;DR video above!

Background : when does GAE spin up instances?

The App Engine serving algorithms are constantly deciding on whether it’s better to queue a request or to spin up a new instance. This takes into account a significant number of factors, (such as queue depth, current QPS, avg. request latency etc) to decide if a new instance should be spun up. It’s easy to talk about that for a single request but it gets much more complicated when considering high request load, or various types of request loads.

With this in mind, there’s four primary systems you need to consider in your configuration file, which can directly impact the decisions that GAE makes about spinning up new instances:

Number of concurrent requests per instance
How long a request can sit in the work queue
If there’s an idle instance available
What type of instance you’re using.

A baseline

Let’s dig into each one a bit to take a look at how it impacts your startup performance.

As a base line, here is 100k qps with the standard scheduler settings for 10 minute test with 800 concurrent connections:

Note, in all the graphs below, we want to be watching the BLUE line; That’s the count of # of instances. My server-code is pretty simple; I’m creating a very isolated case of heavy workload so that I can test how these settings react when load is constantly 4 seconds:

Obviously this is not a real-world scenario. If your server responses are taking 4 seconds, you’ve got a serious amount of other things you need to take a look at and optimize. This does, however, allow me to isolate a specific variable, and see how the scheduler settings respond as a result. All tests below have this same server code, and the only thing that chances is a single scheduler setting.

Concurrent Requests

Probably one of the easiest knobs to turn is changing how many concurrent requests an Instance can handle at a single time. Once this number is eclipsed, the scheduler can spawn a new instance. The max_concurrent_requests setting allows you to control this value (Default: 8, Maximum: 80).

Here’s what it looks like when I change the max_conncurrent_requests from 8(default) to 80 (10 minute duration):

You can see that the number of create and active instances (blue line) dropped significantly from ~100 to ~20, which is important, since that means my bill has been cut by a significant amount.

The tradeoff here is straightforward. The higher this value, the less number of instances you need to spin up. Which can be very helpful (and a big cost savings) if the overhead of all those requests doesn’t bog down the performance of your instance too much. Otherwise, you can fall into a trap

For example, if each request is spending time searching through image data, you may eclipse the memory of your instance since the machine may be doing 8x more work than normal.

Basically, if this number is too high, you might experience increased request latency. (so watch out!)

Pending Latency

When a request comes in, GAE’s front-ends will place it in a queue until an instance becomes available to service the request. If there are no available instances waiting around, then GAE has to make a decision — either it waits for one of the instances to become available or it starts up a new instance. The values of min/max pending latency directly control how long GAE will allow a request to wait in this queue before spinning up a new instance.

The higher the Min Pending Latency value, the longer a request will wait in the queue before triggering a new instance to be spawned. This results in less overall instances to be started (thus reducing the amount of total startup time you have to deal with) but will also result in higher user-visible latency during increased load. (BTW, the minimum value for this is 30ms ..)

The graph below shows the normal scheduler settings, but with min-pending-latency set to 6 seconds, instead of 30s (10 minutes):

The result is that we spin up about 65 instances or so, over the 10 minutes, to handle the load that we were generating. Obviously this isn’t an absolute win vs the max-concurrent-requests, but it does minimize the total number of instances vs. our default settings.

Max Pending Latency, on the other hand, is the time after which App Engine must start up a new instance if a request has been waiting this long.

A low maximum results in starting new instances sooner for pending requests (resulting in more instances to be spawned, and more startup time to be incurred). Where a high maximum means users might wait longer for their requests to be served (if there are pending requests and no idle instances to serve them), but your application will cost less to run.

The graph below shows the normal scheduler settings, but with max-pending-latency set to 4 seconds (10 minutes):

Since the server-work takes about 4 seconds, this one caused a massive spike in the # of instances that were spawned.

Just to sanity check how these two values work together, let’s combine them, setting max to 8 seconds, and min to 6 seconds, and see how this influences our instance count:

What was interesting about this test, was that there was a clear rise towards the end of it. We stayed at about 61 instances for most of the test, but jumped to about 75 towards the end.

For our particular test, these values didn’t have much impact. In a real-world though, adjusting these values results in smoothing out some of the instance spikes that occur during heavy load, and balancing that against user perception and general cost to run your application. Under sustained heavy load, however, this may not be as impactful as other flags we might want to change.

Instance class

App engine has an excellent ability to provide support for instance types (see your types and instance classes), which you typically tune as a factor of trying to find the sweet spot between bandwidth, memory, and the number of instances you need to spin up.

There’s two ways we can chart the impact of this flag. The first is how long it takes to start up and the second is does it impact the number of instances?

Startup time based on instance class

Check out the following graph, which charts instance class vs startup time for a basic hello-world application:

It’s worth noting that for the data above, instance startup time is properly fast with most startups being < 1 second, average. Note that this example is a bare-bones python app, that only returns “hello world”. There can be some outliers though, so be warned that some instance types are more prone to hitting the outlier path than others

While it may seem direct to assume that the startup difference between B* and F* instances has to do with the machine type, my tests show that it’s got more to do with provisioning than anything else. See, when you request a new instance, GAE has to go through and find a machine of that type, provision some space on it for your instance to run, and start the machinery

Instance count based on instance class

The default value for instance_class is F1, which we have a graph for already. We can’t test the B* classes, since those modify our scheduler settings (maybe that’s a separate blog post), so let’s test an F4 instance, and see how that changes our instance count (if at all):

We can see that moving from F1 to F4 had some slight impact on the # of instances created. We at least had a smoother ramp up, and a sub 100 instance set for most of the time.

Side note : fast machines are fast.

There’s many cases where the factor that causes GAE to spin up a new instance is due to machine limits. For example, a compute-heavy application could eclipse the total memory available for the instance type causing another instance to spin up.

A solution to this is to improve the instance size to a machine type that has more resources.

This may significantly increase the cost, per instance, but it can reduce the overall number of instances by a similar amount, which may end up as a cheaper solution.

As such, it’s important to clearly adjust your instance type for the sake of scheduling, but also to reduce the number of spun-up-instances.

Putting it all together

So, let’s combine these settings together to figure out what’s the best way to minimize our instance count, but not make user-perception fall through the floor. This means that for our 4 second block of work, we want the avg latency for a response to be as close to 4 seconds as possible.

To do this, let’s set:

instance_class = F4 ; We know our work isn’t susceptible to the performance of the Server we’re running on, so a faster machine won’t service our requests any faster. We choose F4 though, so that the max # of requests can be handled w/o causing weird side-artifacts like out-of-memory.

max_concurrent_requests = 80 ; This lets’ us maximize the server machine’s potential.

max_pending_latency = 8s ; If any request is waiting 8 seconds, we should start a new instance.

min_pending_latency = 6s ; Don’t spawn an instance unless we’ve been waiting 6 seconds.

Given our same work load and test, let’s see how we fair:

We ended up with about 18 instance throughout the entire test, which is significantly lower than our other tests. On the other side of things, we need to consider how our settings have impacted latency for our test:

Since we know our workload is exactly 4 seconds, we’re seeing about +400ms latency being added to the requests (on average) due to all the other scheduler settings we have in place.

From here, we can start tweaking the numbers to make trade-offs between the # of instances we’re willing to pay for, and the latency overhead it causes to our users. Which is a good spot to be.

Learnings & take away

Even though the load balances built into App Engine will handle the lion’s share of dealing with scheduling and launching your instances, taking control of a number of components can help you hyper-optimize the system and get consistent, expected behavior for your app.

To be fair, it takes a bit to get all these things tested right, but optimizing instance count is highly worth it ;)