Non-Trivial Kubernetes — Monitoring the Cluster

Ethan Steinman
Nov 5 · 4 min read

At Infinia ML, we’re building a machine learning management solution based around Kubernetes. Understanding the health of the Kubernetes cluster is critical to providing a product out customers can rely on.

This article will cover how we monitor our Kubernetes clusters and their underlying infrastructure. It’s written for someone who’s familiar with both Kubernetes and the monitoring landscape, but may not be an expert in either. I’m explicitly going to avoid covering monitoring workloads running in the cluster because so many great tools and guides already exist. Check out a few of my favorites from Sysdig, Sensu and Datadog.

Can the cluster monitor itself?

Running a service like Sensu or the Node Problem Detector in the cluster is frequently recommended as a way to monitor the cluster itself. But there are two problems with relying in-cluster services like this. First, the nodes or cluster may fail in a way that prevents the monitoring tool from running or alerting about a problem. Second, testing load balancers, ingress, and other networking components is harder to do from within the cluster.

To be fair, I think running tools like these can be useful for data collection because they add another layer of insight about past outages and can hopefully catch issues before they become future outages. Just be careful not to rely too heavily on them.

Monitoring core Kubernetes components

The “cluster” here refers to the control plane components, etcd, the kubelets, and kube-proxies. Since these components shouldn’t be monitored using in-cluster services, they need to be monitored using more traditional methods. That means running tools like Sensu or Prometheus outside the cluster on their own infrastructure. Alternatively, cloud based monitoring tools like Datadog or SignalFx can be used if your budget and business cases allow it.

Regardless of the tool, the list of things to monitor isn’t long or complicated. Check that the various processes are running and not crash looping, check health endpoints where available, and scrape the prometheus metrics endpoint that each component exposes. You’ll want to pay special attention to error rates and long request timings from the API server, controller manager, and scheduler, as well as queue length for the controller manager. For etcd, check that there is a leader and that the leader isn’t changing frequently. Also watch etcd for failed or pending proposals.

My preference is to not aggressively alert on-call engineers about these checks and metrics when a node has issues. The cluster should be able to reschedule workloads onto other nodes and keep everything in a healthy state. In the case of a small cluster where there isn’t space for rescheduling everything or where leader election becomes impossible, alerts should be fired for those checks, not the node(s) failing.

Monitoring the infrastructure

The strategy for monitoring the VMs or servers that back the Kubernetes cluster isn’t different from monitoring the cluster itself. Use monitoring tools that aren’t running on, or co-located with, the cluster components. If the VMs are cloud based, the cloud provider will likely provide metrics and health checks that work in most cases. The focus here should be on basic resource utilization (CPU, RAM, network bandwidth utilization, etc) and watching for ports opened by stray NodePort or LoadBalancer services.

Cloud providers also generally provide a channel for communicating maintenance activities. These channels need to be watched so nodes can be gracefully removed from the cluster prior to being taken down by the cloud provider. For this, we follow the AWS recommendation of using an SNS queue to email us about maintenance activities.

In addition to VMs, most clusters will have one or more load balancers sitting in front of them. These should also be monitored using tools that aren’t co-located with the cluster. Total request and request error rates should be monitored at the load balancer. These metrics should be compared against similar metrics at the ingress gateway to ensure traffic isn’t getting lost between the load balancer and ingress gateway. Lastly, this is a good place to check that the various sites or APIs hosted inside the cluster are accessible from the outside world.

Who watches the watcher

So what happens when that sweet monitoring stack crashes? The easiest solution is to have a heartbeat on it. Most alerting tools like Pagerduty and OpsGenie provide this functionality. The other easy solution is to have the in-cluster monitoring tools monitor out-of-cluster monitoring tools, and vice-versa. With either of these solutions, alerts can be fired whenever a monitoring tool has an issue.

Wrapping Up

I hope this helps you architect a reliable Kubernetes cluster you can count on for production workloads. Next up in this series, we’ll cover how we manage our YAML files and deploy them to our various clusters.

Ethan Steinman

Written by

Software Engineer @ InfiniaML

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade