Log-based Alerting in GCP

Published in

Appsbroker CTS Google Cloud Tech Blog

5 min readApr 5, 2022

Source: https://cloud.google.com/products/operations

Introduction

One of my favourite things in the cloud is the tooling that makes observing our services and systems so much more straightforward. As a former ops guy, I have spent many hours staring at dashboards, scouring through logs, and fighting with tooling such as SCOM, Nagios (including the checkmk flavour), and Solarwinds, often in combination to achieve good observability and assist in troubleshooting.

GCP Tooling

In GCP the tooling that provides this is simply called “Operations Suite” or sometimes referred to by its old name “Stackdriver”. This has multiple tools (see Priyanka’s brilliant sketchnote below) but at least in my experience the two most commonly used are “Cloud Monitoring” and “Cloud Logging”. As a top-level summary Monitoring is used to explore metrics, build dashboards, and create alerting policies. Cloud Logging instead focuses on aggregating logs from a variety of sources, categorising them, and making them easily searchable.

Source: https://thecloudgirl.dev/ops.html

The topic I want to explore today is where these two tools intersect and some new functionality (at the time of writing in April 2022 in preview) that has the potential to massively simplify how they can work together, that being Log-Based Alerting.

Cloud Monitoring works with metrics (or dare I say SLIs), things like CPU, RAM, and Disk Utilisation, response latency, data or request throughput, and many, many more. These are incredibly useful in determining what SLOs and SLAs should be, and for tracking errors - for example, HTTP 500 responses received by a load balancer. It is in Cloud Monitoring where alert policies and notification channels are defined.

Source: https://cloud.google.com/monitoring/charts/metrics-selector

Cloud Logging works with logs, storing them in a structure that is easily searchable and can receive from a multitude of sources including from within GCE instances with the ops agent installed.

Why Log-Based Alerting?

So why might I want these two tools to work together? Well, simply when I want to take log information and be notified on, or build a metric using it. Let’s explore this second reason first. For a long time, there has existed a mechanism to create a log-based metric in GCP, with two sub-options — counter metrics and distribution metrics. I have only ever used the former but it has proved very useful in a couple of scenarios, namely:

Counting log entries from a Cloud Function that could sometimes be delayed due to resource contention (a print statement was written every time a delay was experienced)
To count error messages received from ‘legacy’ Windows-based apps that write to event viewer

In the second scenario above I used the log-based metric as an intermediary to then allow me to create an alert policy on the metric simply when the count exceeded zero so as to notify me of this. This works reasonably well; however, there were a couple of gripes around this approach:

Having to manage both logging (where the metric is created) and monitoring (where the alert policy is defined) isn’t super seamless (and requires two different terraform resources with one dependant on the other!)
With a counter log-based metric, when there are no logs within a time series no data rather than zero is written to the metric. This has the unfortunate consequence that once the alert policy threshold and an incident is created, it won’t resolve until the auto-close duration is met, which by default is 7 days! (but thankfully can be set to 30 minutes)

What’s new with Log-Based Alerting?

So having discussed the old method, what’s new? Well, Log-Based Alerts are a new mechanism for defining alerts based directly on logs, the blog post announcing this talks about it in more depth, or this video from the product team:

This makes it much easier and I thought I found the perfect use case with a recent customer who wanted to be notified based on certain user and system actions that are logged on. Unfortunately though after some testing, we found some limitations which didn’t make it suitable which were:

We needed there to be notified every time the entry appeared, currently, the minimum time between notifications is five minutes. I appreciate this might be to prevent overloading some notification channels (like email) but our GCP target was pubsub (before being pushed into Splunk) which could easily handle the throughput without issue
Our preference was to use log sinks to redirect all the logs throughout the org into a logging bucket; unfortunately, we found we couldn’t alert on logs stored within this bucket and so policies would have had to be defined in each project. This means that you can’t define log-based alerts at an org level without ‘cheating’ and defining potentially hundreds (if not thousands in some larger orgs!) of alerts at the project level

Concluding thoughts

I really like the idea behind log-based alerting, it is a common issue, especially when working with ‘legacy’ systems that can’t perhaps be modified to write metrics indicating health directly into Cloud Monitoring. The new solution has many advantages in the simplicity of configuration but still has some quirks which unfortunately prevented its use for my particular use case. But perhaps it’s ideal for yours! If these are the sort of challenges you find interesting at CTS we are constantly looking for people who share our passion but otherwise, until next time — keep it Googley ;)

About CTS:

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!