A quick GKE logs primer

Published in

Google Cloud - Community

6 min readJan 22, 2021

One topic that inevitably comes up when using GKE, is how to leverage its logging integration with Google Cloud Operations (formerly Stackdriver). This is an important topic, not only for customers planning multi-tenant clusters and their fitment into GCP foundations, but also for troubleshooting individual clusters or deployments.

Narrowing down the GKE logging firehose to surface audit, security, and health signals is not as hard as it looks, once you have a clear understanding of the different types of logs and resources at play. This article will give you a quick primer, that will hopefully be enough to get you started and remove the initial pain in dealing with GKE logs.

One note before starting: discussing the different options for logging on GKE is outside of the bounds of this article, and we start from the assumption that your clusters have logging to Cloud Operations enabled (which is the default for new clusters).

Log names and resources

Before dealing with the actual logs, let’s quickly review two key fields we are going to use in log queries to isolate specific information.

My favorite field, especially when dealing with logs at a general level (e.g. when setting up organization sinks, or figuring out what information ends up where like we’re doing here) is logName. This field is a unique handle that identifies the “resource name” of a specific log stream, and allows you to clearly and quickly isolate group of records belonging to a single log type (audit, instance logs, etc.).

One other incredibly useful field is of course resource, and specifically the resource.type attribute, which allows filtering records emitted by a specific GCP resource (cluster, instance, etc.).

We will use a few other fields in our queries, and of course the GKE logging agent also includes resource and Kubernetes metadata in log records which can be used in filters, but with the basics out of the way let’s start looking at actual logs.

Activity log

First in order of importance is of course the activity log in the audit logs family, which cannot be disabled and traces all operations that change a resource state. This is the first (and often only) log that is exported to on-premises systems or stored for archival purposes, since it traces who (which identity) did what (created, modified, or deleted) on which resource, when.

The activity log is usually queried via a logName filter for a specific project:

logName="projects/myprj/logs/cloudaudit.googleapis.com%2Factivity"

Or via the log_id function:

log_id("cloudaudit.googleapis.com/activity")

For GKE, the activity log contains two separate streams of information

operations done on GCP cluster resources (create a cluster, etc.)
operations done on Kubernetes objects “inside” clusters

While the first stream of information is common to all other GCP resources, the second one is specific to GKE, and implemented by leveraging the Kubernetes Audit Policy. At a very simple level, any operation in a cluster that modifies a Kubernetes object will result in a log entry in the activity log, while read operations will be logged to the data access log, but only if it’s specifically enabled at the project level.

What this means in practice, given the very dynamic nature of Kubernetes clusters where controllers constantly manipulate objects, is that as soon as you start using GKE your super-important activity log balloons up in size at least an order of magnitude, with obvious repercussions for log exporting (especially where on-premises systems are the target) and analysis.

And here is where the resource.type field comes into play, allowing us to split the activity log into its two separate streams:

the GCP resource view, filtering on the gke_cluster resource type, and
the Kubernetes view, filtering on the k8s_cluster resource type

For example, if you only want to include cluster-resource level events in your audit log export, and exclude audit events internal to the clusters, you would use this filter:

log_id("cloudaudit.googleapis.com/activity")
resource.type != "k8s_cluster"

To isolate cluster-internal audit events related to Kubernetes objects, the matching filter is of course:

log_id("cloudaudit.googleapis.com/activity")
resource.type="k8s_cluster"

This distinction is fairly important, and I often see customers implementing it, as it allows you to limit the size of the audit export and make sure it only contains resource-level events, while still being able to run parallel queries or sinks for Kubernetes-specific audit events.

One more consideration needs to be made for Kubernetes audit events: the filter above selects all operations that modify a cluster object, including those made by system components. If this is not what you want (and it usually isn’t given their verbosity), an extra filter can be added on the identity that performed the operation so as to only include user or machine identities:

log_id("cloudaudit.googleapis.com/activity")
resource.type="k8s_cluster"
NOT protoPayload.authenticationInfo.principalEmail: "system"

One last useful filter allow selecting operations that resulted in authentication failures:

log_id("cloudaudit.googleapis.com/activity")
resource.type="k8s_cluster"
protoPayload.authenticationInfo.principalEmail="system:anonymous"

This concludes our overview of the activity log for GKE. If you need more granularity to surface specific events from the audit logs, the Accessing Audit Logs and Sample Queries Using the Logs Explorer pages of our documentation have several query examples. Let’s now look at more specific GKE logs.

Event logs

As you probably know, “Kubernetes events are objects that provide insight into what is happening inside a cluster, such as what decisions were made by scheduler or why some pods were evicted from the node”. Events are an important way to understand what’s going on inside a cluster, so it’s not surprising that Cloud Operations has a specific log type for them, and a dedicated page in the Kubernetes OSS documentation (from which the quote above is taken).

This filter allows isolating GKE events:

logName="projects/myprj/logs/events

Or more simply:

log_id("events")

This log is used for pretty much the same purposes as using Kubernetes events via kubectl, with the added benefits provided by Cloud Operations: a much larger retention policy, searchability/aggregation via Logs Explorer, the ability to derive custom metrics (and potentially set alerts) on events, etc.

Some events are specific to GKE and notify administrators of potential configurations issues. For example, when the GKE service agent is unable create a firewall rule for a new load balancer (a usual occurence in tightly controlled Shared VPC setups), an event is created and the corresponding warning is logged to the event log, so that administrators can run the equivalent commands. This is an example:

Node logs

Services running on GKE nodes (kubelet, node problem detector, container runtime, etc.) emit their own logs, which are captured and stored, each with an individual log name corresponding to the component name and all using the same resource type of k8s_node.

You can view aggregated node component logs filtering on the resource:

resource.type="k8s_node"

Or filter individual logs on their logName:

log_id("kubelet")

You can then, of course, combine different filters to narrow down on specific nodes via their instance id, labels, etc. This is an example of looking at the kube-node-configurator log:

Container logs

Container stdout and stderr streams are captured to two separate logs, with a resource type of k8s_container:

log_id("stdout")

and

log_id("stderr")

Resource and Kubernetes labels in each log record allow to easily drill down to specific namespaces, applications, containers, etc. This is how a stdout log record looks like:

Instance logs

Probably not terribly important unless you run into configuration issues but still worth mentioning here is the serial console log, where GKE nodes store serial console output like any other Compute Engine VM (unless of course you have explicitly disabled it).

Its filter is:

log_id("serialconsole.googleapis.com/serial_port_1_output")

And if you need to isolate logs for a specific node, you can of course use its instance id:

log_id("serialconsole.googleapis.com/serial_port_1_output")
resource.labels.instance_id="2139255898313623576"

The GKE specific information in this log is related to the Kubelet bootstrap process, and it helps you narrow down issues related to IAM or service account misconfiguration, which sometimes result in node failures to register in the cluster. Not something you need to do every day hopefully, but when you do, a query like the one shown in this image will probably point you in the right direction:

One other instance-related log worth pointing out is the Linux auditd log, which can be optionally enabled on Container Optimized OS based nodes.

We only scratched the surface of GKE logs, but hopefully this gave you a quick head start to figure out where to look, to understand your clusters behaviour.