My interest in observability in Google Cloud developed in large part in the context of working with GCP customers running workloads on GKE, and one of my very first posts here covered using Stackdriver for those workloads. The very first episode of Stack Doctor also went over what were at the time the “new” GKE monitoring capabilities. This was over two years ago, and there have been some great updates to those capabilities since then. I thought it was time to revisit Cloud Ops for GKE, have another look at the dashboards, and try out the new capabilities. Let’s dive in!
Right away, the new GKE monitoring dashboard looks very different from what we saw released at Kubecon in May of 2018. Instead of 3 tabs — infrastructure, workloads, and services — you now have lists of all the different entities in the workspace (which, as you may recall, can aggregate monitoring information from multiple projects). The first thing I thought of when I saw this was “Well, that’s great, but there is a default namespace in every cluster, how will I ever find what I need?” As it turns out — someone has already thought of that!
There’s a really cool filtering feature that will actually help you to find exactly what you’re looking for. In this example, if you have a copy of the “frontend” service running in three different clusters — the filter lets you select the exact one you want. Once you apply the filter, the entire dashboard is filtered to match:
One of the things that’s been kind of hard to do in the past is to get an aggregated view of the data. For example, what does my resource utilization look like across my namespace? This new view makes it really easy to get this kind of aggregation — and it will scale no matter how many resources are being aggregated, even if you have thousands of pods across hundreds of namespaces. But if you do want to see all of those resources — you can click View All.
That will let you see all of the entities in that category. If that list is too overwhelming — you can filter from here, too!
You can still select a row in the table to get its details. For example, you can see metrics for a pod by clicking on it, which opens the details panel:
From there, click on the Logs tab to get the logs for the pod:
If needed — you can filter logs by severity:
Click the Open in Logging button:
That opens the (new!) Logs Viewer with the query to get those logs pre-populated and executed:
One other thing I really like is that you can easily create an Alerting Policy from the Metrics tab of the details screen:
If you’ve been following me for a while, you probably know that I don’t necessarily think it’s a good idea to alert on these kinds of infrastructure metrics, but I can absolutely envision a use case where you might need to know about, for example, hitting memory limits or something like that.
I’m really happy to see the team make great progress on this experience in the last two years, and I really like where this is going. Next time, I want to approach this from a different angle by forcing an alert/incident and seeing how useful it’s going to be for troubleshooting.
Thanks for reading, and stay healthy out there!