App Engine under heavy load

Val Deleplace
Feb 28 · 5 min read

Yesterday I opened the Logs page of my App Engine Standard application in the Google Cloud console to check a specific log, only to discover much more traffic than I expected.

Unexpected logs: many requests per second

Confirmation in the App Engine “Summary” dashboard:

50 req/s. That’s a lot.

“50 requests per second” is definitely not the traffic from my real users. The real traffic usually looks like this:

0.1 req/s. That’s the usual load.

It’s also not caused by search engine crawlers and spam bots. And (this time at least) it’s also not caused by my negligence in letting a test load injection script run forever from my workstation, or a failing background task retrying forever server-side in the cloud.

The culprit

So, what is this about?

Spikes last longer than usual

It turns out that I have an “IT dept security thing” that finds it appropriate to bombard my app with thousands of shady requests on a regular basis, just to see what happens. This usually runs once a week, lasts for 2 hours (A), and that’s it. When the blitz traffic is ongoing, it dwarfs the “real user traffic” so much that we can’t even see the real traffic anymore in the charts.

Now the security thing spikes (B) seem to be willing to run for 9+ days in a row, which is clearly unusual.

The security tool did find something useful: By injecting malicious URL parameters, it was able to trigger a JS pop-up in the application, and reported it in the issue tracker. This is a potential vulnerability that I will fix right away. And (as far as I understand) this is why it kept searching for days and days, rediscovering the same, or very similar, vulnerability.

This is a happy story. There security tool found a bug and caused no harm despite its offensive strategy:

  • There was no service disruption. Nobody was aware that the security tool was running for much longer than usual. The App Engine autoscaler spawned enough concurrent stateless instances to handle the traffic gracefully.
Autoscaling working just fine
  • The security tool did not accidentally destroy any data. This is because only registered users have write access to the data, and that policy held true during the traffic spike.There was no spike in real usage error rate. Only a spike in 4XX response codes sent back to the numerous invalid requests:
  • The extra cost incurred by App Engine instance was acceptable (i.e. cheap).
  • I didn’t set any monitoring pager alerts in this project, so nothing woke me up in the middle of the night.

Minor impact?

I mentioned that we didn’t experience any service disruption, that no user was aware that anything abnormal was going on, and that the cloud scaling capability shone.

The app uses a Cloud SQL database accepting only a limited number of simultaneous connections, but we didn’t hit this limit because (I suspect) most of the traffic was either immediately rejected, or was served cached content without calling the database.

We use Memorystore Redis as fast server cache, with 1GB capacity which is plenty for our data. We have ~2,000 valid paths (URLs) generating each less than 100KB of JSON, and we store the (path, data) pairs for a while, which usually accounts for a few hundreds cache entries at any given time.

It turns out that the security traffic consists in many invalid requests, but also many valid requests, which end up with a 200 code and the (path, data) saved in cache.

The server cache memory usage chart follows the traffic spike timeline:

and it looks like the 1GB limit is not hit very often, so it should not impact the app performance in a negative way.

But actually, if we zoom a bit to the last 2 days window:

…we see that the cache is very often saturated and flushed.

Let’s use BigQuery to dig a bit deeper! Having all the logs exported to BigQuery is very useful, and I highly recommend creating a “sink” in most projects.

SQL to mine request count and latency

To reveal the real user traffic, I filter out requests from the security tool, based on the user agent string. The resulting daily traffic is too small and noisy to be statistically significant, which is why I’m grouping the results by week.

This provides me with a data grid that I can now save in Google Sheets…

…to produce this chart:

My reasoning is: if the security tool (weeks 7, 8, 9) had significantly perturbed the server cache, then the cache hit ratio would have plummeted for the actuel user requests, and the mean service time would have increased.

Here we see that the (real user traffic) latency may have grown from 240ms to 275ms during the “security tool blitz” weeks, but the delta is too small to conclude anything for sure.

Bottom line, the service was not degraded in any meaningful way. That’s a win for Google Cloud’s infrastructure.

Thanks to Laurent Picard

Val Deleplace

Written by

Engineer on cloudy things @Google. Opinions my own. Twitter @val_deleplace

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade