The hidden superpowers of Stackdriver Logging

Stackdriver Logging is one of Google Cloud’s most under valued product in its portfolio. First of all it’s taken for granted, but second a lot of people don’t know of its superpowers.

It’s a bit Google’s own fault though, because some of the idiomatic client libraries don’t expose these features. But with a bit of trickery you can add them to the libraries yourself, as I did for this article on the google-cloud-go client library.

In this article I’ll be using Reactor. Reactor is a microservice written in golang that I wrote to learn about the observability features in GCP. You can use it to reproduce everything described in this article. Follow the instructions in the README to set it up locally or in a Kubernetes cluster. This article assumes installation on GKE.
traces and logs correlated, with a bonus load balancer log entry

Meta data and trace correlation

The first superpower is correlating the logs with traces. (Do you know of another product that has this ability? If you do, enlighten me!).

If you’re using OpenCensus for tracing it’s fairly easy to make this work as the current trace and span id are available in the context. Have a look at reactor’s internal logger package (func addSpan). It shows how to extract the span id from the context and add it to LogEntry that is send to the stackdriver API.

The above trace was generated by calling reactor with http://<your_host>/rr/split?molecule=[[H]]^[O] . Reactor will do a few http calls to itself, but when you go to the trace console you will notice that between the traces you see the log entries. The only thing you need to do is make sure the Trace and SpanID are added to the LogEntry.

As reactor uses OpenCensus Trace configured with Stackdriver export and header propagation it understands the Stackdriver header format. Reactor was installed on Google Kubernetes Engine and incoming calls routed through an HTTP load balancer (using k8s ingress). Because the load balancer generates the first span, OpenCensus picks this up and will continue adding spans within the load balancer trace. That is why you will also see the load balancers logs in the trace (log line 3 in the screenshot).

Correlating k8s stdout/err with custom logging

At startup time it’s impossible to log directly to Stackdriver logging because you need to initialize the client connection first. Nothing wrong with logging to stdout at that time. Because Google Kubernetes Engine aggregates the logs in Stackdriver Logging we have an opportunity to connect out custom logs with the standard logs that GKE captures. This is done using MonitoredResource.

custom logs correlated with the container stdout/err

When the cluster has the new beta Stackdriver Kubernetes Monitoring enabled the containers standard out and error will be captured under the k8s_container resource.

When we log to the same resource it will be easier to group your custom logs and stdout. It even works out of the box in the Kubernetes workloads screen. If you select one of the pods and go to View Logs you will see the custom logs and stdout/err together.

Error Reporting

Another superpower is the ability to create Error Reports from the log entries.

Have a look at the trace that’s been generated for the following call: http://<your_host>/rr/split?molecule=broken . Reactor is not really happy with the broken molecule, so it returns an error to the user and logs an error. You see the error again mixed with the traces as well as the load balancer log entry as a warning.

If you format your LogEntry in a specific way, an Error Report is created. You do this by adding a location in the source where the error happens or a complete stack-trace. For more detail on how to format the message see the documentation here: Formatting Errors in Stackdriver Logging

Error reporting is a great service. It keeps track of all your major errors including a lot of useful information.

Information like: the first time and the last time an error occurred, the number of occurrences and a detail graph of all the occurrences over the last days. It also shows all the versions this error occurred in, a helpful metric that could help you track down the first version the error happened in. It also shows some recent samples of the log entries.

An important thing to keep in mind is that not all exceptions should end up in Error Reporting. If an error occurs in your code and it can be handled, don’t log it as such, otherwise the error reporting will be full of errors and it becomes useless. So be wise on when logging your errors.

InsertId

When designing a API, you need to be aware that you don’t give away too much information to your api clients when an error occurs. If it can help the API user help to correct mistakes, by all means return that information but make sure not to leak sensitive information.

In the case where you can’t return sensitive information through the API you can still log that and return a unique InsertId that privileged users can use to search the log.

This doesn’t give you carte blanche to log everything though. Still think on what information you log to protect the privacy of your user.

BigQuery

If we are talking about superpowers, we should talk about the biggest superpower of them all: BigQuery. BigQuery’s predecessor was created internally at Google to process its huge amount of logs, so why shouldn’t we make use of this to process our own logs?

The Logging interface makes it easy to export the logs to BigQuery, Cloud Storage or Pub/Sub.

The exported logs generate a partitioned table per log with structured rows. BigQuery has the ability to have different repeated structures and columns within a single row, making it very easy to represent rich log entries as structured data. As a bonus you don’t have to worry about Cloud Logging’s retention policy as BigQuery can keep your logs forever.

Open Census Logging

I think a user still has to do too much plumbing to get most of the logging superpowers though, but I have hope: Currently the Open Census Logging spec cycle is ongoing and hope that it will bring the tools to the user where they don’t need to think about the plumbing.

Here is my list wishlist for the Open Census Logging spec:

  • Set the resource (MonitoredResource) at startup, preferably autodetect or provide clear guidance on how to set it (eg. standard environment variables).
  • Set the application name and version at startup.
  • Request Context aware.
  • Span Context aware.
  • Access to the InsertId so it can be returned to the user.
  • User aware, but having the possibility to obfuscate the user (pluggable).
  • Structured formatted logging with named parameters, these parameters could then be exposed as labels in the log entries. Making it easy to search on other values then a user.
  • Source Code location awareness

Let’s hope the people doing the specs will read this article and not take away my ability to use the features by locking them away behind an abstraction.


Have fun exploiting the superpowers, it will make the people that run your code in production happy.