After I was able to help the startup time for the “Bait and Stitch” application, a few new people reached out to me with requests to assist their performance woes. So, we’re now awash in a fantastic plethora of problems to solve!
One of the more interesting cases, was an old friend of mine, Alicia, who built an application called “Partly Cloudy” which provides high-resolution images of clouds for developers who need to use them in professional presentations. (Kinda like Google Earth View.. But for Clouds..) The app was launched the closed-beta a few weeks ago, and had about 200+ users within the first few days. After about a week of use, however, Alicia woke up one morning to see this little gem waiting for her in the Google Cloud Developer console:
The blue line on this graph charts the number of active GAE instances that her application was using. That’s roughly 16 instances on average, with a peak there of 18. This is problematic, considering this particular application really doesn’t do much except serve some images to the user.
Being billed for 18x concurrent connections wasn’t in Alicia’s business prospectus, so it was time to dig in and figure out why in the world so many instances are being spun up.
How are things setup?
After a few meetings with Alicia, I was able to understand how CCV was set up.
The whole thing worked something like this:
- GAE will handle the request from the client, and do a database query (in our case an NDB query) to data store. This gives a full table-of-contents of all 20,000 images available in the database.
- From there, App engine will randomly pick one, grab the URL details from it
- Fetch the full image from GCS bucket
- Return the full image back to the client.
To get more insight into what’s going on here, let’s take a look at the trace results.
A trace too far.
The developer of Partly Cloudy had a hunch: she was seeing really high latency numbers in the trace results, and was assuming that this was responsible for the high number of instances. (In some cases, high response time can cause new instances to be spun up).
After looking at the List of traces, I had to agree, this looked like the obvious problem.
However, when I dug into the tracing information, things started to get really weird:
The latency was being reported as 6sec, but the trace information said that the sum of all RPC calls was only taking about ~180ms.
As mentioned before, Stackdriver’s Trace tool does a great job of automatically listing all RPC commands, and since these two numbers were a bit confusing, I decided to put some custom profiling in (again) and see if there was maybe something not being reported
Nope. Not only did the timings not match, but the submission of the custom tracing information added an extra 143 ms to our request (Mental note: fix that later…).
At this point, I was seriously confused. I asked for a drop of the source code so I could try to repo the problem at my local workstation. This made things even weirder, since I couldn’t get the same high trace numbers from my desktop:
The breakthrough came when I looked at the logging information for the slower request:
It was coming from a mobile device.
This gave me an idea : What if the connection speed is causing the trace numbers to be inflated?
It seemed like we were getting some traction here.
Lunch and learning.
It’s worth pointing out something here, which caused me a large amount of strife in debugging this issue. Traces can get logged, and become visible almost immediately as seen below:
However, over a 30 minute span, the frontend information will be added to it, and you’ll end up seeing an inflation in the overall tracing time. Here’s that same traceline, 17 minutes later.
And the tracing data, 30 minutes later:
The point here is that some additional data needs to be calculated by the Cloud Load Balancer, and will roll in over time. As such, in order to do proper analysis on the traces, make sure you give it a good 30–50 minutes before digging in too deep. (NOTE** The Stackdriver Team is aware of this nuance, and is hoping to get it addressed in the future )
What does this have to do with instance count?
It’s important to note something here : Connection speeds have no impact on instance count.
More specifically, a connection will route through an App Engine Frontend, which will be responsible as a liaison between the client and the GAE instance. Once the GAE instance is done with the request, it’s passed back to the frontend, and then goes idle. The AEF, from that point on, will stay open, and connected to the client during the duration of sending the information down, while the GAE instance can be servicing other requests.
This is one of the benefits of App Engine being a managed architecture; those front-end instances can stay open and transfer traffic, and you don’t really have to deal with the logic or stress involved with it. Moreover, it doesn’t influence your billable instance count whatsoever.
So what was causing Partly Cloudy’s instances to be so high?
Turns out, the reason for the high instance count was resounding success. When we flipped over to the traffic graph, we saw that over the past 6 hours, people have been using the Partly Cloudy service pretty aggressively:
Partly cloudy was a victim of it’s own success! But still, the instances are too high. Thankfully though, we’ve eliminated all the weird, hidden issues, the solution became really straightforward.
Getting back to the client
One of the biggest reasons for Partly Cloudy having such a high outbound traffic count, was that the instances were always responsible for passing the image data back to the client. Which.. Seemed redundant, Why was the client fetching the images through app engine? Couldn’t they fetch it directly from Cloud Storage?
The solution then, is straightforward : Let’s find a way to spend less time having the clients talk to GAE, and more time having them ping the GCS buckets directly. Alicia had an answer for this problem, immediately :
- When the client fetches an image from GAE, append the next 10 image IDs as an extra header field.(“x-cloudish-nexts”=”123,456,789,…”)
- The client will fetch the next 10 images directly from GCS, rather than needing to go through GAE for it.
- When we’re out of pre-fetched images, go back to the server, and grab the next batch.
This cuts the amount of traffic going through Alicia’s GAE instances by 10x, and the solution footprint is pretty small, from a code perspective.
However, from an instances perspective, the results are better than expected. Letting this run for about 8 hours saw a drop from about 16 instances on average, to about 1.5.
Pretty nice for my bottom line.
Every millisecond counts
App engine has a ton of amazing features in it; One of the more powerful ones, is the fact that tracing and profiling are built right into the platform. You can see old traces, and get drilldowns to how long RPC functions take inside your requests. In this particular use-case, the problem had nothing to do with any RPC functions, but with client-connections. So, we didn’t so much use Trace to find the issue, but rather rule out what wasn’t the issue, allowing us to focus more on the issues at hand, than chasing ghosts.
So my catch phrase remains true : Keep Calm, and profile your code!