Understanding and Profiling GCE cold-boot time
I gave a talk in NYC last week, where I met the people behind “FoodiePics” a social network that only allows you to share pictures of food. No text. No Audio. No emoji. Just a picture of food, and the Geo-loc tag of where it was taken at.
Turns out, they had read my previous articles on startup performance in App Engine, and came to the talk to get my help with their app, which was seeing similar cold-boot issues with their Google Compute Engine instances. Their GCE instances were taking the uploaded photos, running some logic on them (to make sure that the picture was of food) and compressing and storing it for later. They showed me a graph which looked somewhat like this:
What they were seeing was 380+ second cold boot times for their applications, while the response latency for a request was in the 300ms range.
At this point, I haven’t had a chance to dig into GCE cold-boot performance much, so I was happy to use this an excuse to investigate further.
Let’s dive in!
How fast is GCE startup, by itself?
You may see a common trend in my performance-debugging pattern at this point: When trying to track down performance issues like this, I first like to remove my code, and establish a baseline, to know how badly my code is impacting things.
As such, the first steps in figuring out the problem was to run a quick test to see if:
1) Given a static configuration, how long does a GCE instance take to go from “plz create” to “ok, you can SSH to me” ?
2) Given changes to the configuration, how does this number change?
So, let’s do a simple test, just 20 iterations of starting an instance, timing how long until we can ssh to it, and killing it.
On average, a 2CPU, 8GB VM (running debian) took about 41.5 seconds to boot up. So what if I increased the core and memory count? I would assume that improving resources should see an improvement in boot time, since there’s more cycles available to help the image get up and running.
Average here was about 49.39 seconds. A little worse, but still within error. Does this continue to get worse, linearly? Let’s jump ahead 4x and see.
Average this time was about 52 seconds, which doesn’t exactly conform to my thought that vm startup time would be linear with resource requested; These numbers seem all within error tolerances. We easily see ~30 second boot times, with spikes around it depending on provisioning overhead.
That being said, GCE recently increased the max number of CPUs you can request with a VM to 64, and when we tested that startup time, we did see that the average cold-boot time was consistently higher than other configurations:
The average here seems to be about 73 seconds. Which is a a bit higher than others.
In any case, these numbers let me know two things:
1) Cold-boot time of GCE VMs are not correlated with CPU/Ram configurations (except in the 64-core case, which is higher)
2) The 300 second cold-boot time of FoodiePics VMs is being caused by something else.
Let’s check the image?
Just to make sure I’ve kicked the tires on this one, let’s run one more test that looks at the type of image you can use. Google Compute Engine uses operating system Images to create the root persistent disks for your instances. If you’re not familiar with this term, we’re not talking JPGs here. An “Image” in this sense is a binary data file that contain a boot loader, an operating system, and a root file system. You specify an image when you create an instance, and when it’s created, one of the first steps is to copy the image data onto the disk of the instance you’ve created.
So, let’s do a small test: keep the configuration the same, but change the image type; How does that impact cold-boot time? (for this test, I’m using ubuntu-14–04 vs windows-2012-r2)
Not surprisingly, there was some variation in the results. Most of the linux based builds acted pretty similar, but the windows builds were drastically slower.
But FoodiePics was using linux VMs, and only launching one or two at a time; So the choice in image isn’t having anything to do with their slow cold-boo time.
Aside : Request & provisioning time
One nice side effect of profiling things, is that you typically end up learning more about some part of the system you didn’t have visibility to before (or rather, you end up adding transparency to previously back-box systems) for me, running these tests showed off a few parts of the system I didn’t know about; namely that GCE has a distinct latency in Responding to your request, provisioning your vm, and then booting your instance. (you can read more on the official stages here)
With the help of fellow Cloud DA Terry Ryan, I was able to get a profile on each stage of their application, which looks like this:
- Request is the time between asking for a VM getting a response back from the API acknowledging that you’ve asked for it. In my tests, it was timings of how long it took the service to respond to the insert instance REST command.
- Provisioning is the time GCE is taking to find space for your VM on its massive architecture; which you can find by polling the Get Instance API on a regular basis, and waiting for the “status” flag to be set from “provisioning” to “running”.
- Boot time is what you’d expect; Image installation and custom code execution up to the point my service is available.
What’s revealing about this graph is that there’s roughly ~7sec dedicated to responding to a request, and about ~18sec dedicated to provisioning the VM, any time you want to spin up a new instance.
It’s not clear (just yet) if these times are similar when the request to create an instance comes from the load balancer, but it’s something to keep in mind.
Have another helping.. Of performance
There’s lots of little things that all contribute to the cold-boot time of a GCE instance. When you’re trying to track down issues, eliminating potential problems is just as important as finding the cause.
So, with these small tests, we’ve found:
1) Cold boot time is invariant with respect to CPU/Ram configurations
2) Startup time is impacted by your choice of Image. Most Linux ones will act the same, but there’s a large variation between Linux / Win
3) This is obviously something FoodiePics is doing in the code when starting up the instance.
After doing a little more digging, we did catch on to what the problem was; But that’s a subject for a different article ;)
Want to know more about how to profile App Engine startup time?
Want to know more about GAE’s scheduler settings?
Want to know ways to avoid starting new instances in the first place?
Want to become a data compression expert?