Surviving traffic spike from Hacker News: my dreaded Google Cloud invoice

Published in

Google Cloud - Community

10 min readOct 9, 2019

This story relates events that happened on September 26th, 2019. They are all true. A website that I built, “Programming Idioms”, hit the first page of the “orange site” (aka Hacker News). My cloud provider monitoring console let me find out if the sudden traffic had caused an outage or not. And the invoice a few days later taught me that running a service in the cloud is not a free meal.

Disclaimer 1: I currently work for Google Cloud. However, I had designed and deployed the website Programming Idioms before joining Google.

Disclaimer 2: This is a faithful report of my specific experience. It obviously doesn’t apply to any unrelated similar cloud projects. YMMV.

1. Symptoms

At 5:07 PM, I receive a notification: Someone is asking to add support for their favorite programming language in my website. Then five minutes later, another notification, from someone else, asking for the same language. What’s going on?

On Programming Idioms, writing a snippet in a non-supported language results in a pleasant, vintage error page:

I decide to log into the Google Cloud Platform (GCP) web console to check if many such errors have recently occurred. The contribution system might be broken for some reason.

The first thing I notice is an unusual shape in the “requests per second” dashboard. I also notice a high instance count.

Unusual number of server instances currently running

My first thought is that I made a blunder like an infinite loop of failing tasks launched by a cronjob. I have to fix this right away! Infinite loops in the cloud are not good for the bank account.

At this point, it is useful to know that GCP does not provide yet a simple way to cap the cost of a given Project, i.e. to automatically prevent the billed $$$ from skyrocketing. This would, however, be nice to have, to say the least.

At a Project level, it is possible to set up “Budget Alert” emails when the cost exceeds some specified thresholds. But in the case of my script running amok and wreaking havoc (or a genuine sudden spike of traffic), I will most probably read the mail and take action after a lot of cash has already been burned.

Nevertheless, at the App Engine Standard level there is a “Daily spending limit” option. Good to know!

This option could save your retirement plan

Edit: At the Service level (GCP component), you can learn an effective way to Set up cost controls with quota. This covers a large share of all possible bad scenarios.

I was not thinking of a huge influx of visitors, yet.

My experience with scalability of web services is that uncertainties around expected traffic volume are often addressed with ambitious defensive designs and over-provisioning, following advice and wisdom from this great book:

Basically, entrepreneurs have a high opinion of the probability of their service going viral and successful, and care a lot about being able to handle the inevitable huge load, before other crucial concerns. In my opinion, market research, UI, UX, usability, having a MVP, gathering early feedback, etc. are more pressing than dealing with a million concurrent users. Success will come eventually (or maybe not at all), successive waves of traffic increase will happen, and each significant increase will be an opportunity to revise the scalability strategy, taking the actual traffic and error monitoring metrics into account, as well as a better understanding of the needs of the project (than early in the prototyping phase).

I acknowledge however that an unexpected exposure in popular media is a tangible threat. Nobody wants their service to fail when a spike of traffic does happen. Actually, it’s the worst time to go down!

In my case, Google Analytics confirms an unusual influx of visitors:

The traffic per hour before the spike is definitely not zero (~6000 sessions per month is ~8 per hour, on average). It’s simply dwarfed by the magnitude of the spike.

The culprit is just a few clicks away, in the “Acquisition” section:

Now that I understand what is going on, I want to know if the service quality is suffering from the load.

2. Architecture

Programming Idioms uses the following infrastructure components:

App Engine with a Go language environment
Datastore, a NoSQL database
Memcache, a distributed in-memory data cache

These components per se are scalable: App Engine spins up as many stateless web server instances as needed (autoscaling, yay), and the Datastore has “a distributed architecture to automatically manage scaling”.

This doesn’t automagically mean that my website won’t crumble under the load. Careful design and implementation are needed to correctly leverage the scalability of the platform. My main challenge is data consistency: Programming Idioms accepts contributions from anyone without needing to create an account, and I have to take care of edit conflicts when users are modifying the same “idiom” at the same time.

I designed the website using a few performance optimization techniques, with responsiveness in mind, which also happen to decrease the costs:

Frontend

Gzip is enabled for all responses.
JS is concatenated and minified into a single file, which contains jQuery, Bootstrap, Prettify, and my own JS code.
The JS file is included at the bottom of the HTML.
CSS is concatenated and minified.
All static files are served with a very long public cache header. Updating the static files is done through a revving technique: Whenever I modify a static file, I bump the virtual folder of static assets, implicitly invalidating obsolete resources in the browser cache and server cache.
The website is mostly text, with a limited number of small images.
All the pages are server-side rendered. Per my measurements, templating is reasonably fast and is not the bottleneck. The alternative (client-side rendering) would involve JSON marshalling, which is not free either.
All static resources are declared as “static_dir” in app.yaml, which means that they are served by an efficient distributed CDN, and don’t use my App Engine server instances at all.
The number of HTTP requests is kept small: 10 requests for a page, out of which 8 responses (static assets) go to the browser cache and don’t need to be requested anymore.

Backend

The Datastore is a NoSQL document database. I have ~180 “idioms” containing ~3000 snippets. Each idiom is stored as one document containing several snippets. The rate of reads will by far exceed the rate of writes.
Accessing Memcache is orders of magnitude faster than accessing the Datastore. As I have a grand total of ~2MB of data, I aggressively store all the snippets in Memcache.
A modification of a snippet must invalidate the snippet in the cache. Cache invalidation is reputed to be hard. My strategy consists of selectively evicting all cache entries related to the modified element (but not flushing the whole cache on every write operation).
Generating a page from an HTML template is fast, but still incurs some work. I decided to aggressively cache (in Memcache) all the HTML pages.
Indexing is important for search, but is not allowed to slow down the handling of a request. I delay the indexing work, which will be put in a task queue and executed outside the scope of the user request.
I keep a history of all the successive versions of all the snippets, but writing to the history is a non-pressing concern which is also delegated to a background task queue.
Go is a nice choice for App Engine standard because the clones start very fast and the runtime has a small memory footprint. A single instance of the smallest class F1 is usually more than enough to serve many concurrent visitors.
External API calls (e.g. to the database) are usually the latency bottleneck in the processing of a web request by a serverless instance. Thus, it makes sense to leverage goroutines and waitgroups to launch several API calls concurrently, as they are network-bound, not CPU-bound. I actually don’t need to do this for the idiom view page, however I do use concurrency to improve the text search experience.

A server instance vCPU spends a lot of its time idle, even when serving several concurrent requests, because most of the latency is spent waiting for external components. This is an example; I don’t use GCS and Pub/Sub in this project.

For further insights about Go and cloud performance see this article and this video.

App Engine comes with a generous daily free quota. In fact my usual 10K pageviews per month rarely reach a few percent of the quota.

This freebie somewhat “masks” the actuality of what will be charged when we exceed the quota. Little did I know what I would really be facing.

But it’s not the time yet for money consideration, first I want to know if the service quality is suffering from the load.

3. Monitoring

The [ERROR] level in the logging view is not too bad. Mostly a dozen occurrences of the “not a supported language” BSOD. So far, no abnormal error rate. (From a UX perspective, I should fix this, though.)

ERROR level in the Stackdriver Logging view

The [WARNING] level in the logging view is not too bad either. It mostly consists of a 404 for a minor missing image, oops. So far, no abnormal error rate.

WARNING level in the Stackdriver Logging view

To get detailed insights about the server-side performance, I usually go to the BigQuery interactive console:

There, I discover that the logs for today are missing from BigQuery, and much later do I realize that a configuration glitch occurred during a migration to the go111 runtime just two days earlier. Oops. I still have the logs exported to Cloud Storage though, but the JSON files are a bit less convenient to dig through. ./jq is still a friend!

In the Latency dashboard of App Engine, I can check the 50th, 95th, and 99th percentiles before and after the traffic spike.

This looks bad but is actually pretty good, except for the 99th percentile

The median latency is always extremely low (~2ms) because the static files are served by a CDN-like system. This is good, but also not a very relevant metric.

The 95th percentile figure is much more important to me, and it turns out it’s consistently below 100ms, which is fine.

I should, however, check why the impressive turquoise line of the 99th percentile rose from 200ms before the 26th to ~1000ms after the 28th. Even if less than 1% of the requests hit that latency, a 5x jump is kind of worrying. It’s also intriguing that the degraded 99%’ kicks in one day after the spike, and then remains durably degraded well after the spike.

A grid containing a square for each snippet

It turns out that all the slow requests are about displaying a specific page featuring a grid dynamically built out of the current state of the whole database: All idioms, all snippets.

The grid data is cached server-side, and frequently updated. 65% of the time this specific request hits the cache and is fast, 35% of the time the grid gets regenerated and the response time is high.

For some reason, the grid now takes 50% more time to serve, in the fast case (cached) and in the slow case as well. This may be related to having migrated recently to App Engine 2nd gen runtimes, though I’m not 100% sure at this point.

The homepage and the idiom detail page are fine: From the server perspective, their latency has not been degraded during the traffic spike.

4. The cleaver

Before the traffic spike

I receive an invoice every month. As I mentioned, the traffic (6K sessions, 10K pageviews) usually fits easily within the free quota. Thus App Engine, Datastore, and Memcache usually cost me zero.

I do have 5GB of non-crucial files in GCS (logs), which I could delete if I want to save 11¢.

Have you noticed the $0.01 to write all my logs to BigQuery, for analytics? I highly recommend setting up this log export. BigQuery is great for log investigations.

After the traffic spike

The month of September saw a total of 44K sessions, 160K pageviews. 94% of them occurred after the 26th.

Six days after the spike, I received the invoice for September:

The infrastructure costs didn’t drive me to bankruptcy after all.

Two things surprise me in the Out Bandwidth section:

25GB seems like a lot, at first. Let’s do the math: 25GB/160K pages is 150KB per page, on average. This makes sense, as an idiom detail page view downloads 240KB gzipped (700KB uncompressed), and the homepage is slightly heavier (340KB). Subsequent pages with a warm cache are super-small: 15KB only!
Static files in App Engine used to be served for free, if I remember correctly. I don’t know if it was a bug or a policy, but it seems not to be the case anymore. Static files probably account for ~95% of my 25GB of egress network traffic.

Here I am, six dollars poorer, and impressed by a massive community of developers who came to have a look and stayed to contribute a snippet on their own — implementing a two-liner may look like a trivial no-brainer, but it is actually very time-consuming to write sensible code and link to the official documentation. I want to thank everyone, heartfully.

If you’d like to contribute a snippet, here’s an entry point: Look at the grid, and find an empty circle in the column of a language you’re familiar with. I also encourage you to edit existing snippets to improve them.

More tech, cloud, and go elucubrations on my Twitter! https://twitter.com/val_deleplace