Improving GCE boot times with custom images
Working with the FoodiePics app was an interesting experience. Their app is a social network that only allows you to share pictures of food. No text. No Audio. No emoji. Just a picture of food, and the Geo-loc tag of where it was taken at. As I wandered down the streets of NYC, It would use my geo-location to show me pictures of food nearby. Neat!
To accomplish all this, FoodiePics uses a GCE backend for processing and storage. Sadly, during bursty traffic, there’s significant latency due to the load balancer causing a bunch of GCE instances to spin up, and initialize themselves.
After profiling GCE cold boot time from a number of angles, we were able to start building a mental model about how the process works, that looks a little like this:
We’ve already accounted for the time taken to Requesting and Provisioning my VM (which is non trivial) , and we understand that our image choice can cause variation; what’s left is for us to time the user code, since that must be the problem.
Which.. Is where we pick up now.
Food, scripts, and start-up time
Once FoodiePics GCE instance was created, their very first step was to execute a startup script which installed various services and code libraries that would allow the instance to function and be accessible externally. Their script was significantly more complex, but it generally looked something like this:
There’s not really a lot of custom code here. It’s mostly pulling down packages and pre-built scripts locally so things can run.
We’ve eliminated all other performance issues at this point, so our startup-time problem had to be due to one of these script functions taking some large block of time. But which one was it?
How to profile GCE startup scripts
To profile each of these stages, I need two things:
- A way to time each subsection execution in the startup script
- A way to report those stages to some external system so that I can visualize it.
For the first problem, I’m working in Linux, and dealing with bash scripts, so the SECONDS command should work just fine; wrapping it around each of the sections of my script.
For the second problem, there’s lots of options. Chances are that you’ve got code that already uses the powerful tool systems like statsd or brubeck. But for my purposes, that would require pulling down yet another package not to mention the time required in starting those services before I could use them.. I’ll skip for now.
On the other hand, the Stackdriver Custom Metric API might work. All GCE instances come pre-installed with access to write to this API, so that would get rid of the need to bring down new packages or host complex backends. Sadly though, writing to the API requires some complexity; (namely auth) which I’m not willing to put forth right now, just to record some timings.
For this case, I did something much more hack-tastic: Store the timings in a file, and serve them as a new endpoint.
This allowed me to poll the endpoint from an external location and get back my data w/o too much heavy lifting. Below is a graph showing the timings for each section, as 100 iterations of timing the startup script.
It’s clear to see in the graph that the time, per stage of startup script, doesn’t fluctuate too much between tests; It’s somewhere between 60 and 75 seconds.
After some Q&A with FoodiePics CTO, it became clear that they only updated those scripts about once or twice a month at this point, so the content was mostly static.
This was an opportunity for improvement : FoodiePics is basically paying a minute of boot time, every cold boot, just to grab the same packages and install them to the VM. If the scripts aren’t changing that much, then we can expect the content they are pulling down to not change that much either.
And that means we can use Custom Images!
So far, we’ve only been talking about “public images”, that is, preconfigured combinations of OS, and bootloader combos. These images are great when you want to get up and running fast, but as you start building production-level systems, you’ll soon realize that the large portion of bootup time is no longer in booting the OS, but rather the user-executed startup sequence which has to do with grabbing packages / binaries and initializing them. (AKA exactly what FoodiePics is seeing)
Since most of your instances start in an identical state with your user code (same binaries, same distribs, same networking instances etc) you can get rid of some of the overhead of needing to pay for the overhead of that for each cold-boot by using custom images of boot disks
These basically create a snapshot of the host disk information, and when the target VM is booted, the image information is copied right to the hard drive. This is ideal for situations where you have created and modified a root persistent disk to a certain state and would like to save that state to be reused with new instances. Or where your setup includes installing (and compiling) a number of big libraries or pieces of software.
Since the majority of my time was taken up by installing external packages, it makes a lot of sense for FoodiePics to just create a custom image with these things already installed. Following the tutorials for making your own custom images was straightforward, which allowed us to realize and profile a solution pretty quickly.
Since we’ve moved from a public image, to a custom image, we needed to test that the “boot” phase of our startup time didn’t get impacted. We’d hate to move to a custom image, and suddenly see this phase get worse. So, we re-ran the tests from before and compared the graphs:
You can see that overall, the “boot” phase of installing their image didn’t fluctuate much, which is fantastic, considering that not only are we installing our core OS, but also a bunch of other software, which was previously pulled down via network commands.
However the startup-script code significantly improved; We went from “minutes” down to sub seconds, which is a good improvement.
Every millisecond counts!
The FoodiePics team ran into a very common problem for most developers. Once you reach a certain level of complexity in your startup scripts, the cold-boot time of your application starts to suffer. At which point, it’s a great idea to consider using custom images to bring your boot time back down to something normal.
Now, if you’re in the cloud space, you know that the big-brother to custom images is a concept called Containers. Which, I soon found out, had it’s own set of performance concerns. But that’s a topic for a different article.