OpenShift 4 Monitoring — Exploring Grafana (Part II)

7 min readMay 15, 2020

If you haven’t read Part I of this series, you can find it here.

In Part I of this series, we looked how to access the Grafana web UI and also dug deeper into the etcd dashboard. In this post we will take a closer look at the Kubernetes/Compute Resources/Cluster dashboard, second on the list of default dashboards provided out of the box with your OpenShift 4 cluster.

Before we begin, it will be good to know where some of the metrics come from. Some metrics used are provided by kube-state-metrics, which generates metrics from the Kubernetes API, focusing on various objects like deployments, nodes, pods, etc. Others may be provided by cAdvisor, which is embedded into kubelet and reports the metrics given by the underlying Linux cgroup implementation. More importantly, we can monitor the node resources used by containers running on it.

You might also wonder what the kubernetes-mixin tag to the right of the dashboard name means. It is an upstream project that provides some standard Prometheus alerts, rules and Grafana dashboards. You can read more about it here and here.

Kubernetes/Compute Resources/Cluster Dashboard

This dashboard gives a good overview of the CPU, memory and network metrics of the cluster.

Headlines

1. CPU Utilization

There are 10 CPU modes (user, system, nice, idle, iowait, guest, guest_nice, steal, soft_irq and irq). This metric uses the node_exporter metric node_cpu to track the CPU utilization of your cluster by summing all modes except idle:

1 — avg(rate(node_cpu_seconds_total{mode=\”idle\”, cluster=\”$cluster\”}[1m]))

2. CPU Requests Commitment

This is calculated using the total number of requested cores by all containers divided by the cluster’s allocatable CPU cores.

The kube-state-metrics metrics used here are kube_pod_container_resource_requests_cpu_cores and kube_node_status_allocatable_cpu_cores.

sum(kube_pod_container_resource_requests_cpu_cores{cluster=\”$cluster\”}) / sum(kube_node_status_allocatable_cpu_cores{cluster=\”$cluster\”})

3. CPU Limits Commitment

This takes the sum of limits on CPU cores that can be used by all containers divided by the sum of the cluster’s allocatable CPU cores.

The kube-state-metrics metrics used here are kube_pod_container_resource_requests_cpu_coresand kube_node_status_allocatable_cpu_cores.

sum(kube_pod_container_resource_limits_cpu_cores{cluster=\”$cluster\”}) / sum(kube_node_status_allocatable_cpu_cores{cluster=\”$cluster\”})

4. Memory Utilization

The cluster’s memory utilization is calculated by dividing the sum of all nodes’ memory available by the total allocatable memory of the cluster and subtracting that by 1.

1 — sum(:node_memory_MemAvailable_bytes:sum{cluster=\”$cluster\”}) / sum(kube_node_status_allocatable_memory_bytes{cluster=\”$cluster\”})

5. Memory Requests Commitment

This takes the sum of containers’ requested memory bytes (given by the kube-state-metric kube_pod_container_resource_requests_memory_bytes) and divides that by the total allocatable memory of the cluster.

sum(kube_pod_container_resource_requests_memory_bytes{cluster=\”$cluster\”}) / sum(kube_node_status_allocatable_memory_bytes{cluster=\”$cluster\”})

6. Memory Limits Commitment

This is the sum of the limits on the amount of memory that can be used by containers (given by the kube-state-metric kube_pod_container_resource_limits_memory_bytes) divided by the total allocatable memory of the cluster.

sum(kube_pod_container_resource_limits_memory_bytes{cluster=\”$cluster\”}) / sum(kube_node_status_allocatable_memory_bytes{cluster=\”$cluster\”})

CPU

1. CPU Usage by namespace

This panel shows you the CPU usage by namespace:

sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\”$cluster\”}) by (namespace)

It uses the metric node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate which is given by:

sum by (namespace, pod, container) (rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container!="POD"}[5m])) * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)

2. CPU Quota by namespace

Sorted by namespaces, the CPU quota panel shows you the following metrics:

Pods

This is a counter for the number of pods running in a namespace. If you hover over the pod count, it shows ‘Drill down to pods’. Click on it and it takes you to another dashboard Kubernetes/Compute Resources/Namespace (Pods), filtered by that namespace.

This pod count is given by the metric count(mixin_pod_workload{cluster=\”$cluster\”}) by (namespace)

where mixin_pod_workload is the sum of ReplicaSet, DaemonSet and StatefulSet pods:

- expr: |    max by (cluster, namespace, workload, pod) (      label_replace(        label_replace(          kube_pod_owner{job="kube-state-metrics", 
          owner_kind="ReplicaSet"},            "replicaset", "$1", "owner_name", "(.*)"        ) * on(replicaset, namespace) group_left(owner_name) topk 
        by(replicaset, namespace) (          1, max by (replicaset, namespace, owner_name) (          kube_replicaset_owner{job="kube-state-metrics"}          )        ),         "workload", "$1", "owner_name", "(.*)"      )    )  labels:    workload_type: deployment  record: mixin_pod_workload- expr: |    max by (cluster, namespace, workload, pod) (      label_replace(        kube_pod_owner{job="kube-state-metrics",   
        owner_kind="DaemonSet"},        "workload", "$1", "owner_name", "(.*)"      )    )  labels:    workload_type: daemonset  record: mixin_pod_workload- expr: |    max by (cluster, namespace, workload, pod) (      label_replace(        kube_pod_owner{job="kube-state-metrics",   
        owner_kind="StatefulSet"},        "workload", "$1", "owner_name", "(.*)"      )    )  labels:    workload_type: statefulset  record: mixin_pod_workload

Note: At the time of writing, static pods running in the namespace are not included in the pod count.

Workloads

Similarly, the workload count is given by the metric count(avg(mixin_pod_workload{cluster=\”$cluster\”}) by (workload, namespace)) by (namespace). Clicking on it will take you to the dashboard Kubernetes/Compute Resources/Namespace (Workloads), filtered by namespace.

CPU Usage

CPU usage by namespace is given by: sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\”$cluster\”}) by (namespace)

where node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate uses the metric reported by cAdvisor:

sum by (cluster, namespace, pod, container) (rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[5m])) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""}))

CPU Requests

This takes the total number of requested cores (given by the kube-state-metric kube_pod_container_resource_requests_cpu_cores) by all containers in the cluster and sorts them by namespace.

sum(kube_pod_container_resource_requests_cpu_cores{cluster=\”$cluster\”}) by (namespace)

CPU Requests %

This is the CPU usage divided by total number of requested cores, for each namespace.

sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\”$cluster\”}) by (namespace) / sum(kube_pod_container_resource_requests_cpu_cores{cluster=\”$cluster\”}) by (namespace)

CPU Limits

This sums the total limits on CPU cores (given by the kube-state-metric kube_pod_container_resource_limits_cpu_cores) that can be used by containers in each namespace.

sum(kube_pod_container_resource_limits_cpu_cores{cluster=\”$cluster\”}) by (namespace)

CPU Limits %

This takes the CPU usage divided by the total limit on CPU cores that can be used by all containers, for each namespace.

sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\”$cluster\”}) by (namespace) / sum(kube_pod_container_resource_limits_cpu_cores{cluster=\”$cluster\”}) by (namespace)

Memory

1. Memory Usage w/o cache

Memory usage of the cluster looks at the size of RSS across all namespaces.

sum(container_memory_rss{cluster=\”$cluster\”, container!=\”\”}) by (namespace)

2. Memory Requests by namespace

Memory requests panel shows you the following metrics sorted by namespace:

Pods

see pod count in CPU Quota section above

Workloads

see workload count in CPU Quota section above

Memory Usage

We looked at this metric in the previous panel.

sum(container_memory_rss{cluster=\”$cluster\”, container!=\”\”}) by (namespace)

Memory Requests

This sums the kube-state-metrics metric kube_pod_container_resource_requests_memory_bytes and sorts them by namespace. This metric looks at the number of requested memory bytes by a container.

sum(kube_pod_container_resource_requests_memory_bytes{cluster=\”$cluster\”}) by (namespace)

Memory Requests %

This takes the memory usage of the cluster (size of RSS) and divides that by the total requested memory of containers, for each namespace.

sum(container_memory_rss{cluster=\”$cluster\”, container!=\”\”}) by (namespace) / sum(kube_pod_container_resource_requests_memory_bytes{cluster=\”$cluster\”}) by (namespace)

Memory Limits

This takes the kube-state-metrics metric kube_pod_container_resource_limits_memory_bytes to monitor the total limit on the amount of memory that can be used by containers per namespace.

sum(kube_pod_container_resource_limits_memory_bytes{cluster=\”$cluster\”}) by (namespace)

Memory Limits %

This takes the memory usage of the cluster (size of RSS) and divides that by the total limit on the amount of memory that can be used by containers per namespace.

sum(container_memory_rss{cluster=\”$cluster\”, container!=\”\”}) by (namespace) / sum(kube_pod_container_resource_limits_memory_bytes{cluster=\”$cluster\”}) by (namespace)

Network

1. Current Network Usage by namespace

Current Receive Bandwidth

The following two queries give the network utilization for all pods by namespace, measured in bytes.

sum(irate(container_network_receive_bytes_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)

Current Transmit Bandwidth

sum(irate(container_network_transmit_bytes_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)

Rate of Received Packets

The following two queries show the rate of network utilization for all pods by namespace, measured in packets.

sum(irate(container_network_receive_packets_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)

Rate of Transmitted Packets

sum(irate(container_network_transmit_packets_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)

Rate of Received Packets Dropped

To show network saturation, a close measure would be to monitor packets dropped, given by the following two queries.

sum(irate(container_network_receive_packets_dropped_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)

Rate of Transmitted Packets Dropped

sum(irate(container_network_transmit_packets_dropped_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)

2. Receive Bandwidth

As mentioned previously, receive bandwidth and transmit bandwidth (next panel) will show the network utilization in bytes for all pods in a namespace.

sum(irate(container_network_receive_bytes_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)

3. Transmit Bandwidth

sum(irate(container_network_transmit_bytes_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)

4. Average Container Bandwidth by Namespace: Received

This panel gives you the overall average receive bandwidth by namespace.

avg(irate(container_network_receive_bytes_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)

5. Average Container Bandwidth by Namespace Transmitted

This panel gives you the overall average transmit bandwidth by namespace.

avg(irate(container_network_transmit_bytes_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)

6. Rate of Received Packets

The following two queries give the network utilization for all pods by namespace, measured in packets.

sum(irate(container_network_receive_packets_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)

7. Rate of Transmitted Packets

sum(irate(container_network_transmit_packets_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)

8. Rate of Received Packets Dropped

Like before, we can monitor network saturation by monitoring bothcontainer_network_receive_packets_dropped_total and container_network_transmit_packets_dropped_total.

sum(irate(container_network_receive_packets_dropped_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)

9. Rate of Transmitted Packets Dropped

sum(irate(container_network_transmit_packets_dropped_total{cluster=\”$cluster\”, namespace=~\”.+\”}[$interval])) by (namespace)

That’s all! Some of the metrics are repeated, depending on how they are visualized or categorized on the dashboard. On a whole, a sweeping look at the dashboards should give you a general idea on the health of your cluster and helps you identify which components need closer inspection.

Stay tuned for Part III!

OpenShift 4 Monitoring — Exploring Grafana (Part II)

Kubernetes/Compute Resources/Cluster Dashboard

Headlines

CPU

Memory

Network

Written by Jane Ho