If you haven’t read Part I of this series, you can find it here.
In Part I of this series, we looked how to access the Grafana web UI and also dug deeper into the etcd dashboard. In this post we will take a closer look at the Kubernetes/Compute Resources/Cluster dashboard, second on the list of default dashboards provided out of the box with your OpenShift 4 cluster.
Before we begin, it will be good to know where some of the metrics come from. Some metrics used are provided by kube-state-metrics, which generates metrics from the Kubernetes API, focusing on various objects like deployments, nodes, pods, etc. Others may be provided by cAdvisor, which is embedded into kubelet and reports the metrics given by the underlying Linux cgroup implementation. More importantly, we can monitor the node resources used by containers running on it.
You might also wonder what the kubernetes-mixin tag to the right of the dashboard name means. It is an upstream project that provides some standard Prometheus alerts, rules and Grafana dashboards. You can read more about it here and here.
Kubernetes/Compute Resources/Cluster Dashboard
This dashboard gives a good overview of the CPU, memory and network metrics of the cluster.
Headlines
1. CPU Utilization
There are 10 CPU modes (user, system, nice, idle, iowait, guest, guest_nice, steal, soft_irq and irq). This metric uses the node_exporter metric node_cpu
to track the CPU utilization of your cluster by summing all modes except idle:
1 — avg(rate(node_cpu_seconds_total{mode=\”idle\”, cluster=\”$cluster\”}[1m]))
2. CPU Requests Commitment
This is calculated using the total number of requested cores by all containers divided by the cluster’s allocatable CPU cores.
The kube-state-metrics metrics used here are kube_pod_container_resource_requests_cpu_cores
and kube_node_status_allocatable_cpu_cores
.
sum(kube_pod_container_resource_requests_cpu_cores{cluster=\”$cluster\”}) / sum(kube_node_status_allocatable_cpu_cores{cluster=\”$cluster\”})
3. CPU Limits Commitment
This takes the sum of limits on CPU cores that can be used by all containers divided by the sum of the cluster’s allocatable CPU cores.
The kube-state-metrics metrics used here are kube_pod_container_resource_requests_cpu_cores
and kube_node_status_allocatable_cpu_cores
.
sum(kube_pod_container_resource_limits_cpu_cores{cluster=\”$cluster\”}) / sum(kube_node_status_allocatable_cpu_cores{cluster=\”$cluster\”})
4. Memory Utilization
The cluster’s memory utilization is calculated by dividing the sum of all nodes’ memory available by the total allocatable memory of the cluster and subtracting that by 1.
1 — sum(:node_memory_MemAvailable_bytes:sum{cluster=\”$cluster\”}) / sum(kube_node_status_allocatable_memory_bytes{cluster=\”$cluster\”})
5. Memory Requests Commitment
This takes the sum of containers’ requested memory bytes (given by the kube-state-metric kube_pod_container_resource_requests_memory_bytes
) and divides that by the total allocatable memory of the cluster.
sum(kube_pod_container_resource_requests_memory_bytes{cluster=\”$cluster\”}) / sum(kube_node_status_allocatable_memory_bytes{cluster=\”$cluster\”})
6. Memory Limits Commitment
This is the sum of the limits on the amount of memory that can be used by containers (given by the kube-state-metric kube_pod_container_resource_limits_memory_bytes
) divided by the total allocatable memory of the cluster.
sum(kube_pod_container_resource_limits_memory_bytes{cluster=\”$cluster\”}) / sum(kube_node_status_allocatable_memory_bytes{cluster=\”$cluster\”})
CPU
1. CPU Usage by namespace
This panel shows you the CPU usage by namespace:
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\”$cluster\”}) by (namespace)
It uses the metric node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate
which is given by:
sum by (namespace, pod, container) (rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container!="POD"}[5m])) * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
2. CPU Quota by namespace
Sorted by namespaces, the CPU quota panel shows you the following metrics:
- Pods
This is a counter for the number of pods running in a namespace. If you hover over the pod count, it shows ‘Drill down to pods’. Click on it and it takes you to another dashboard Kubernetes/Compute Resources/Namespace (Pods), filtered by that namespace.
This pod count is given by the metric count(mixin_pod_workload{cluster=\”$cluster\”}) by (namespace)
where mixin_pod_workload
is the sum of ReplicaSet, DaemonSet and StatefulSet pods:
- expr: | max by (cluster, namespace, workload, pod) ( label_replace( label_replace( kube_pod_owner{job="kube-state-metrics",
owner_kind="ReplicaSet"}, "replicaset", "$1", "owner_name", "(.*)" ) * on(replicaset, namespace) group_left(owner_name) topk
by(replicaset, namespace) ( 1, max by (replicaset, namespace, owner_name) ( kube_replicaset_owner{job="kube-state-metrics"} ) ), "workload", "$1", "owner_name", "(.*)" ) ) labels: workload_type: deployment record: mixin_pod_workload- expr: | max by (cluster, namespace, workload, pod) ( label_replace( kube_pod_owner{job="kube-state-metrics",
owner_kind="DaemonSet"}, "workload", "$1", "owner_name", "(.*)" ) ) labels: workload_type: daemonset record: mixin_pod_workload- expr: | max by (cluster, namespace, workload, pod) ( label_replace( kube_pod_owner{job="kube-state-metrics",
owner_kind="StatefulSet"}, "workload", "$1", "owner_name", "(.*)" ) ) labels: workload_type: statefulset record: mixin_pod_workload
Note: At the time of writing, static pods running in the namespace are not included in the pod count.
- Workloads
Similarly, the workload count is given by the metric count(avg(mixin_pod_workload{cluster=\”$cluster\”}) by (workload, namespace)) by (namespace)
. Clicking on it will take you to the dashboard Kubernetes/Compute Resources/Namespace (Workloads), filtered by namespace.
- CPU Usage
CPU usage by namespace is given by: sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\”$cluster\”}) by (namespace)
where node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate
uses the metric reported by cAdvisor:
sum by (cluster, namespace, pod, container) (rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[5m])) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""}))
- CPU Requests
This takes the total number of requested cores (given by the kube-state-metric kube_pod_container_resource_requests_cpu_cores
) by all containers in the cluster and sorts them by namespace.
sum(kube_pod_container_resource_requests_cpu_cores{cluster=\”$cluster\”}) by (namespace)
- CPU Requests %
This is the CPU usage divided by total number of requested cores, for each namespace.
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\”$cluster\”}) by (namespace) / sum(kube_pod_container_resource_requests_cpu_cores{cluster=\”$cluster\”}) by (namespace)
- CPU Limits
This sums the total limits on CPU cores (given by the kube-state-metric kube_pod_container_resource_limits_cpu_cores
) that can be used by containers in each namespace.
sum(kube_pod_container_resource_limits_cpu_cores{cluster=\”$cluster\”}) by (namespace)
- CPU Limits %
This takes the CPU usage divided by the total limit on CPU cores that can be used by all containers, for each namespace.
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\”$cluster\”}) by (namespace) / sum(kube_pod_container_resource_limits_cpu_cores{cluster=\”$cluster\”}) by (namespace)
Memory
1. Memory Usage w/o cache
Memory usage of the cluster looks at the size of RSS across all namespaces.
sum(container_memory_rss{cluster=\”$cluster\”, container!=\”\”}) by (namespace)
2. Memory Requests by namespace
Memory requests panel shows you the following metrics sorted by namespace:
- Pods
see pod count in CPU Quota section above
- Workloads
see workload count in CPU Quota section above
- Memory Usage
We looked at this metric in the previous panel.
sum(container_memory_rss{cluster=\”$cluster\”, container!=\”\”}) by (namespace)
- Memory Requests
This sums the kube-state-metrics metric kube_pod_container_resource_requests_memory_bytes
and sorts them by namespace. This metric looks at the number of requested memory bytes by a container.
sum(kube_pod_container_resource_requests_memory_bytes{cluster=\”$cluster\”}) by (namespace)
- Memory Requests %
This takes the memory usage of the cluster (size of RSS) and divides that by the total requested memory of containers, for each namespace.
sum(container_memory_rss{cluster=\”$cluster\”, container!=\”\”}) by (namespace) / sum(kube_pod_container_resource_requests_memory_bytes{cluster=\”$cluster\”}) by (namespace)
- Memory Limits
This takes the kube-state-metrics metric kube_pod_container_resource_limits_memory_bytes
to monitor the total limit on the amount of memory that can be used by containers per namespace.
sum(kube_pod_container_resource_limits_memory_bytes{cluster=\”$cluster\”}) by (namespace)
- Memory Limits %
This takes the memory usage of the cluster (size of RSS) and divides that by the total limit on the amount of memory that can be used by containers per namespace.
sum(container_memory_rss{cluster=\”$cluster\”, container!=\”\”}) by (namespace) / sum(kube_pod_container_resource_limits_memory_bytes{cluster=\”$cluster\”}) by (namespace)
Network
1. Current Network Usage by namespace
- Current Receive Bandwidth
The following two queries give the network utilization for all pods by namespace, measured in bytes.
sum(irate(container_network_receive_bytes_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)
- Current Transmit Bandwidth
sum(irate(container_network_transmit_bytes_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)
- Rate of Received Packets
The following two queries show the rate of network utilization for all pods by namespace, measured in packets.
sum(irate(container_network_receive_packets_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)
- Rate of Transmitted Packets
sum(irate(container_network_transmit_packets_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)
- Rate of Received Packets Dropped
To show network saturation, a close measure would be to monitor packets dropped, given by the following two queries.
sum(irate(container_network_receive_packets_dropped_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)
- Rate of Transmitted Packets Dropped
sum(irate(container_network_transmit_packets_dropped_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)
2. Receive Bandwidth
As mentioned previously, receive bandwidth and transmit bandwidth (next panel) will show the network utilization in bytes for all pods in a namespace.
sum(irate(container_network_receive_bytes_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)
3. Transmit Bandwidth
sum(irate(container_network_transmit_bytes_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)
4. Average Container Bandwidth by Namespace: Received
This panel gives you the overall average receive bandwidth by namespace.
avg(irate(container_network_receive_bytes_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)
5. Average Container Bandwidth by Namespace Transmitted
This panel gives you the overall average transmit bandwidth by namespace.
avg(irate(container_network_transmit_bytes_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)
6. Rate of Received Packets
The following two queries give the network utilization for all pods by namespace, measured in packets.
sum(irate(container_network_receive_packets_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)
7. Rate of Transmitted Packets
sum(irate(container_network_transmit_packets_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)
8. Rate of Received Packets Dropped
Like before, we can monitor network saturation by monitoring bothcontainer_network_receive_packets_dropped_total
and container_network_transmit_packets_dropped_total
.
sum(irate(container_network_receive_packets_dropped_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)
9. Rate of Transmitted Packets Dropped
sum(irate(container_network_transmit_packets_dropped_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)
That’s all! Some of the metrics are repeated, depending on how they are visualized or categorized on the dashboard. On a whole, a sweeping look at the dashboards should give you a general idea on the health of your cluster and helps you identify which components need closer inspection.
Stay tuned for Part III!