Monitoring Kubernetes Pods Resource Usage with Prometheus and Grafana

Explore how Prometheus and Grafana can be leveraged to monitor and visualize the resource usage of Kubernetes pods.

SANKET RAI
6 min readMay 19, 2023

Disclaimer: Please note that the step-by-step guide provided below focuses on monitoring a single-node Kubernetes (K3s) cluster deployed within a DMZ. It is important to acknowledge that this solution does not take into account security aspects. Therefore, it is strongly advised to treat this guide as a reference and implement robust security measures for production-level Kubernetes deployments.

In the world of containerized applications, Kubernetes has become the de facto orchestration platform for managing container workloads. As the complexity of deployments grows, so does the need for robust monitoring solutions to ensure the optimal performance and resource utilization of Kubernetes pods. One popular combination for monitoring Kubernetes clusters is Prometheus and Grafana. In this article, we will explore how these powerful open-source tools can be leveraged to monitor and visualize the resource usage of Kubernetes pods.

Why Monitor Kubernetes Pods?

Monitoring Kubernetes pods is crucial for several reasons. First and foremost, it helps ensure the efficient utilization of resources. Pods that are consuming excessive CPU or memory can negatively impact the overall performance of the cluster. By monitoring resource usage, administrators can identify and rectify any inefficiencies or bottlenecks.

Monitoring also aids in capacity planning. It provides insights into the historical usage patterns of pods, enabling administrators to allocate resources more effectively and anticipate future scaling needs. Additionally, monitoring facilitates troubleshooting and debugging by offering real-time visibility into the behavior of pods, making it easier to pinpoint issues and analyze performance trends.

Step-by-step Guide

Step 1: Create an Authorization Token for the Kubelet API

The Kubelet API is accessible through port 10250 on the host, allowing connections from all network interfaces. However, it can only be used by an authorized user. Here we will create a bearer token for our Prometheus server to be able to authorize and use the Kubelet API.

First, we create a new service account for the Prometheus server API user named ‘prom-api-user’. Creating a new service account is not required, and any existing service account can be used. On the Kubernetes host, run the following ‘kubectl’ command to create a new service account.

kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
name: prom-api-user
EOF

Next, we create a cluster role binding to assign the ‘cluster-admin’ role to the newly created service account or an existing service account using the following ‘kubectl’ command. Please note that a role named ‘cluster-admin’ with cluster-wide privileges is created by K3s during installation. If you are using some other Kubernetes distribution, such a cluster role might not exist or be named differently.

kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cluster-admin-prom-api-user
subjects:
- kind: ServiceAccount
name: prom-api-user
namespace: default
roleRef:
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
EOF

Finally, we create a long-lived API token for the given service account using the following ‘kubectl’ command.

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: prom-api-user-secret
annotations:
kubernetes.io/service-account.name: prom-api-user
type: kubernetes.io/service-account-token
EOF

Get the token using the following command.

TOKEN=$(kubectl get secret prom-api-user-secret -o jsonpath='{.data.token}' | base64 --decode)

Step 2: Identify the Metrics of Interest

Now that we have an API token for using the Kubelet API, we should identify the metrics of interest to us. A number of container and hardware metrics are exposed in Prometheus exposition format through the ‘/metrics/cadvisor’ endpoint of the Kubelet API. Since our goal is to monitor pod resource utilization, we will use the CPU, memory, network I/O, and disk I/O utilization container metrics. These metrics are labeled with pod names, making it easier to query them using the pod name in graphing tools. The following table describes some useful metrics provided by cAdvisor embedded in the Kubelet. We will use a subset of these to monitor resource utilization by our application pods.

Important Container Resource Utilization Metrics Exposed by Kubelet (cAdvisor)

Step 3: Deploy Prometheus and Update Config

In our case, we deploy a Prometheus server outside of the Kubernetes cluster. Prometheus server can be deployed as a Kubernetes pod/service and the procedure described here will still be relevant. After deploying the Prometheus server, update the ‘scrape_configs’ block in the Prometheus config file to configure the new target.


scrape_configs:

- job_name: "Kubelet"
metrics_path: "/metrics/cadvisor"
scheme: "https"
honor_timestamps: true
bearer_token_file: <filename>
tls_config:
insecure_skip_verify: true
static_configs:
- targets: ['<host>:10250']

Replace <filename> with the path to the file containing the bearer token (value of TOKEN generated in step 1), and <host> with the IP address or FQDN of the Kubernetes host.

The Prometheus version here is 2.9.2. Please check the documentation for the Prometheus version you are using for appropriate field names. For example, in the newer versions, the bearer token is defined under the ‘authorization’ field and the ‘bearer_token_file’ field should be invalid.

Reload the Prometheus configuration without restarting the process by using the SIGHUP signal. If you’re running on Linux this can be performed by using ‘kill -s SIGHUP <PID>’, replacing <PID> with your Prometheus process ID.

Step 4: Deploy Grafana and Create Dashboard

Deploy a Grafana instance and configure it to use the configured Prometheus server as a data source. Create a new dashboard, and create new panels for the different resource usage metrics. Use the following PromQL queries to construct panels for the metrics.

CPU Utilization

The following query will give the number of vCPU cores that are being used by each pod in the default namespace.

sum by (pod) (rate(container_cpu_usage_seconds_total{job="Kubelet",namespace="default"}[1m]))

Memory Utilization

We use the ‘container_memory_working_set_bytes’ metric since this is what the OOM killer is watching for.

sum by (pod) (rate(container_memory_working_set_bytes{job="Kubelet",namespace="default"}[1m]))

Disk I/O Utilization

The most basic disk I/O utilization metrics are bytes written and read. These metrics are not appropriately labeled by pod names, so we will evaluate the total disk I/O utilization for each device across all containers.

(sum by (device) (rate(container_fs_writes_bytes_total{job="Kubelet"}[1m]))) + (sum by (device) (rate(container_fs_reads_bytes_total{job="Kubelet"}[1m])))

Network Utilization

Important network utilization metrics are bytes transmitted and received. Again, these metrics aren’t appropriately labeled by pod names, so we will evaluate the total network I/O utilization for each interface across all containers.

(sum by (interface) (rate(container_network_transmit_bytes_total{job="Kubelet"}[1m]))) + (sum by (interface) (rate(container_network_receive_bytes_total{job="Kubelet"}[1m])))

Use {{pod}} as the legend for CPU and memory utilization panels, {{device}} for disk I/O utilization panel, {{interface}} for network utilization panel, and give an appropriate title and description to each panel.

Conclusion

Monitoring the resource usage of Kubernetes pods is essential for maintaining the optimal performance and efficiency of your cluster. Prometheus and Grafana provide a powerful combination of monitoring and visualization tools that enable administrators to gain deep insights into the behavior of their pods.

References

Further Reading

--

--