Reducing Stackdriver’s Logging Resource Usage

Published in

Google Cloud - Community

6 min readFeb 3, 2018

Yesterday I received an alarming mail from Google informing me about the new pricing model for Stackdriver logging and that I am exceeding the free tier limit. The Stackdriver pricing model had a rough start including some adjustments and postponements. As of today, charging is expected to start on March 31, 2018. This means if I want to stay within the free tier limit, I should not exceed 50GB of log intake per month. That is quite a lot for my small cluster, so why would it use more than that?

First Look

I decided to take a look how bad the situation really was.

Woah! 😱 The morning of day 2 of the month, and I am already 37GB in? Good thing charging has not yet started. Facing the reality I moved on to drill down into were the logs come from. Since I had a good portion of log data, chances were high I find something in the logs, right? 😉 The resource table clearly showed me were to find the low hanging fruits. The Month To Date (MTD) and projected End Of Month (EOM) numbers for the resource GKE Container tops everything else by orders of magnitude.

Reason 1: Google Kubernetes Engine Bug

Looking through the looks I found out that there is a bug in the synchronizer. It has been firing multiple times per second for days:

09:18:54 Restarting synchronizer: kubernetes-dashboard-key-holder-kube-system.
09:18:54 Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: kubernetes-dashboard-key-holder-kube-system watch ended with timeout 
09:18:54 Restarting synchronizer: kubernetes-dashboard-key-holder-kube-system. 
09:18:54 Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: kubernetes-dashboard-key-holder-kube-system watch ended with timeout 
09:18:54 Restarting synchronizer: kubernetes-dashboard-key-holder-kube-system. 
09:18:54 Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: kubernetes-dashboard-key-holder-kube-system watch ended with timeout

This does produce quite some log volume for Stackdriver to ingest and that piles up adding to the overall bill. It’s one of those moments where I catch myself mumbling exponential backoff…

To stop the torrent of log lines from the broken dashboard, I restarted the kubernetes dashboard pod. The hard way, of course:

$ kubectl -n kube-system delete pod kubernetes-dashboard-768854d6dc-j26qx

Reason 2: Verbose Services

Note: This subsection’s data is sourced from a different cluster which did not experience the aforementioned bug but had a huge log intake for a different reason.

In another cluster I also experienced a huge intake of logs. However, there was no log spamming, meaning that this cluster was just full of regular log lines. To find out if there are services that produce significantly more log lines than others I created a log-based metric.

This metric is basically just a counter of log lines, grouped by the resource label namespace_id. With this metric in place, I headed over to Stackdriver Monitoring and created a graph that plots the log lines per second grouped by namespace.

Obviously, this is most valuable when every service is confined to exactly one namespace. Now I was able to spot the most verbose services and dug a bit deeper into them to reduce their verbosity.

Mitigation 1: Exclusion

The first solution to the high log intake problem is to take less logs in. How unexpected! Luckily, there is a method for that called Exclusion. On the resources page we can create exclusion rules (filters if you will) to reduce the log intake in a reasonable way. Reasonable here means allowing important log entries to enter the system while dropping the less useful ones.

The following rule, for example, discards all log entries of log level INFO. It is a pretty simple example, however, we are free to use all the nice operators we know from regular log filtering activities. Exclusions are a powerful tool!

Here is a copy’n’paste friendly version of the same rule.

resource.type="container" severity="INFO"

Note that you can even sample logs by creating an exclusion filter and setting the drop rate to a value less than 100%. For my use case, an exclusion rate of 95% provides me with just enough samples to assess a past problem while keeping the log intake amount reasonable. During issue triage I recommend disabling exclusions temporarily or adjusting them to pass all related logs at least.

Fun fact: Stackdriver logs the actions (create, delete, etc.) performed on exclusion rules, thus creating just another log source, the Log Exclusion log source. #inception

I wonder if one can create an exclusion rule for log exclusion. 🤔

Mitigation 2: Monitoring

The next log overdose mitigation technique I like to share uses a log-based metric to alert before things turn ugly. Stackdriver comes with some handy system metrics. Systems metrics means, these are meta data from the logging system. One of those data points is bytes_count. I use this metric in the Stackdriver Monitoring system to get an early warning if log intake exceeds the expected levels.

Here is my policy using a Metric Threshold condition:

Let’s have a closer look at the metric threshold.

I am monitoring the resource type Log Metrics and there the metric “Log bytes”.

An acceptable intake rate for me is 10kb/s. If hit constantly, that results in about 24.2GB of total log intake in a 28-day-month and about 26.8GB in one of those longer 31-day-months. Both values leave some good room for unforeseen issues and reaction time.

As you can see in the graph, my cluster was way beyond that threshold for quite a while. That was the bug I described earlier and which took me some time to find. With that alert in place, the same or similar bugs will fire an alert after a 1-minute grace period for log bursts.

Before I wrap this up, one word of caution: Thresholds set to low may harm your inbox! 😅 Been there, done that.

Conclusion

Stackdriver’s warning email may sound scary, but there are ways to gain control over the log intake and also be prepared for unforeseen issues by having metrics-based alerts in place.

Originally published at danrl.com.