Performance Monitoring Best Practices for IBM Cloud Pak for Data — Part 2

Yongli An
IBM Data Science in Practice
10 min readMar 16, 2021
Trio of Satellite Dishes

Quick Recap

Part 1 of the blog has covered how to use the metrics from the following two layers to help the performance monitoring tasks:

  • Application and Cloud Pak for Data-level metrics
  • Cluster-level basic monitoring

This is Part 2 of the blog which will cover the “Advanced monitoring and analysis” to show how to use customized queries from Prometheus UI for more targeted monitoring.

Prometheus Queries to Zoom in

In case the out-of-the-box Prometheus dashboards are not enough, custom Prometheus queries are available in the Prometheus UI to help. The Prometheus UI is accessed from the “Metrics” option of the OCP console, as shown in the picture below, by clicking on “Prometheus UI”.

Prometheus UI

This will open the Prometheus UI in a separate page, as shown below, where the user can create and run customized queries.

the UI page where to run customized queries

Here are some commonly used queries to gain insight to cluster or pod level.

  • CPU/Memory usage vs node capacity by node
  • CPU/Memory request total vs node capacity by node
  • CPU/Memory usage vs limits percentage by pod

CPU usage vs node capacity

The query below will show the CPU usage on the cluster nodes over time as a percentage of the node capacity. This is an easy way to check if any node has a capacity issue for the workloads. It also helps to show if any major undesirable load unbalance exists across all the nodes in the cluster.

100* sum (node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate) by (node)/(sum(kube_node_status_allocatable_cpu_cores) by (node))
CPU usage vs node capacity, by node

Any spikes mark interesting time frames to zoom into. While the screen shot above shows a history of past 16 days, the one below zooms into the high load peaks of the first day.

CPU usage vs node capacity covering 1 day period

By selecting the busiest period only, more details are visible. Some interesting correlations can be made between what’s observed or experienced by the end user (i.e., the application performance) and what’s going on within each node of the cluster.

CPU usage vs node capacity, showing the busiest node only

Further, when selecting the busiest worker node only, it shows that the highest CPU utilization reaches 50% of the node capacity. In other cases, there could be a node with usage reaching 99% of the node capacity. Overly stressed nodes should raise concerns. If other nodes have more free capacity while only one or two nodes are heavily utilized, the cluster can become more balanced by moving some of the services from the busier nodes to other nodes. One common approach on how to balance a cluster is documented in Cloud Pak for Data knowledge center. Extra caution should be taken as moving pods from one node to another is a disruptive operation with some risks.

But if all the nodes are having high CPU or memory utilization (say 90%+), then this would be an overall cluster capacity issue, and adding extra resources to the cluster should be considered.

Memory usage vs node capacity

The following query shows the memory usage of each node over time as a percentage of the node capacity. This graph will show whether the cluster memory is enough for the workload(s) running on it. A user can see any memory constraint on the nodes, or if the workload distribution across nodes is reasonably even or not.

100* sum(container_memory_rss{container!=""}) by (node)/sum by (node) (kube_node_status_allocatable_memory_bytes)
Memory usage vs node capacity, by node

The graph above shows a picture of memory usage. Each node has various level of usage, but all are below 50% of the node, similar to the CPU usage. You may notice that the memory usage on each node always comes down to a certain base level after each busy period without any upward trend over time. This indicates a well-behaved system and application.

In general, any node with usage constantly spiking to anywhere above 90% is worth some investigation and planning. For example, if all the nodes are more or less evenly utilized and all are with a usage above 90%, then extra cluster-level capacity should be considered. One option is to add more resource to each node. Another option is to add extra nodes to the cluster. Both will lower the overall utilization on each node to allow workload growth or spikes.

CPU request total vs node capacity

Most service pods start with certain resource request and limit settings. The request setting is the amount of resource to be reserved for the pod even if there is no workload running in the pod. (Note that this is not the same as the actual CPU or memory usage on each node.) The request total on each worker node is an important metric to watch. If the request total is in high percentage of the node capacity (say > 85%) on any node, it means that node has limited resources for dynamic workloads that need to start new pods. If an application that needs to start a new pod is not able to get the needed resources, it will fail to start.

The following query shows the request total in percentage of the node capacity. While limits could be over-committed (i.e., the limit total exceeds the cluster capacity), the request total must be less than the node capacity so that the pods can be successfully deployed and started.

100*(sum by (node) ((sum by (node,pod,namespace) (kube_pod_container_resource_requests_cpu_cores{container!=""}))* on (pod,namespace) group_left()(sum by (node,pod,namespace) (kube_pod_status_phase{phase=~"(Running).*"} == 1)))/sum by (node) (kube_node_status_allocatable_cpu_cores))
CPU request total vs node capacity, by node

By comparing “usage percentage” shown in the earlier sections with the “request percentage” in the graph above, some tuning opportunities could be identified. For example, the CPU request total goes up to 100% on one worker node, but the CPU total usage (as shown in the earlier section) on the same node was only up to 50%, it could mean one or more of the following:

  • the request settings might be oversized for the present workload, therefore the request settings could be reduced,
  • the application is not able to properly scale to take advantage of the available CPU resource for better performance and higher throughput, or
  • the workloads are unevenly distributed within the cluster, and some balancing work should be done to allow effective use of the full cluster capacity.

Memory request total vs node capacity

Similarly, the query below will show the memory requests as a percentage of the node capacity.

100* (sum by (node) ((sum by (pod, node, namespace) (kube_pod_container_resource_requests_memory_bytes{container!=""}))* on (pod,namespace) group_left()(sum by (pod, node,namespace) (kube_pod_status_phase{phase=~"(Running).*"} == 1)))/(sum by (node) (kube_node_status_allocatable_memory_bytes)))
Memory request total vs node capacity, by node

Similar analysis and recommendations can be made as those in the CPU request total section.

CPU usage vs limit by pod

A pod starts with the requested resources, and its resource usage can grow up to its limit when needed. All running pods on the same node compete for resources above their own requests up to their own limits when resources are available. In other words, the needed CPU or memory above the request is not guaranteed.

One important observation from this query is to see if any pods run frequently at 100% of the limit setting. Such cases could mean application performance and scalability is constrained by the current pod limit setting. You should consider increasing the limit to support the workload better. One option is to scale the service in question to its next size (if applicable) using the documented scaling command in the Knowledge Center. Another option is to increase the limit setting for that particular pod only.

The following query shows CPU usage vs the limit setting for all pods in the zen namespace:

100*(sum by (pod) (pod:container_cpu_usage:sum{namespace="zen",pod=~".*"}) / sum by (pod)(kube_pod_container_resource_limits_cpu_cores{namespace="zen"}))
CPU usage vs limit by pod

Hovering over the pods that reach 100% of their limit shows their pod names. In this example it mainly affects wdp-couchdb* pods. To see this more clearly, the query can be further customized to target these pods only as shown below.

100*(sum by (pod) (pod:container_cpu_usage:sum{namespace="zen",pod=~"(wdp-couchdb).*"}) / sum by (pod) (kube_pod_container_resource_limits_cpu_cores{namespace="zen"}))
CPU usage vs limit by pod, showing wdp-couchdb pods only

We can see that wdp-couchdb pods often reach 100% of their CPU limit. It could be a good idea to extend this limit for better performance.

Memory usage vs limit by pod

A similar query can show the memory usage versus limit setting for all the pods in the zen namespace.

100*(sum by (pod) (container_memory_working_set_bytes{image!="", namespace="zen"}) / sum(kube_pod_container_resource_limits_memory_bytes{namespace="zen",pod=~"(wml).*"}) by (pod))
Memory usage vs limit by pod

Hovering over the pods that reach 100% of their limit shows their pod names. In this example, those are the zen-metastoredb pods.

Customizing the query to only zen-metastoredb pods shows a clearer picture:

Memory usage vs limit by pod, showing metastoredb pods only

The graph shows that memory usage of zen-metastoredb pods grew continuously but with several sharp drops, suggesting that something dramatic happened. The usage could have dropped to zero but due to the sampling in metric collection, the usage may not show as exactly zero. If this was not due to a planned manual restart of the pods, then it likely could be due to unexpected events like an out-of-memory situation. To mitigate, the pod limit should be increased to allow more memory as needed. One way to get a larger limit setting is to scale the Cloud Pak for Data control plane to its next size. The default out of the box is the small size.

Identify pods with OutOfMemoryKilled

It’s important to check if any pods are getting OutOfMemoryKilled error. Such events are very disruptive to application stability and performance. The following query will show how many of such events happened in the specified time window which can be specified interactively on the UI. As this query results in a count, the graph needs to be “stacked” to show multiple cases more clearly.

kube_pod_container_status_last_terminated_reason{namespace="zen", reason ="OOMKilled"}==1
pods with OutOfMemoryKilled, in a stacked graph

As suspected, based on the earlier graph showing the sharp memory usage drop, the graph above confirms zen-metastoredb pods indeed had OOMKilled error and thus got restarted. That’s the reason the memory usage dropped down to zero. The other two pods that had the same issue are “zen-core-api” and “wkc-glossary-service” pods.

Automate Prometheus Metric Collection

While it is convenient to use the graphical ad-hoc Prometheus query page, this method is limited to the data available within the Prometheus retention period.

Pros using the WebGUI:

  • already integrated with OCP console
  • immediate graph views available out of the box
  • good for short/medium term ad-hoc investigation

Cons using the WebGUI:

  • retention period of the Prometheus data is limited (default twoweeks, extensible to four weeks)
  • does not suit for long term observation
  • manual work

For comparison, between selected test runs over a longer time than the retention period, query data can be collected programmatically using the Prometheus HTTP API and saved for automated processing.

Sample Automation Script — Python

The Python request module offers an elegant method to script requests using the Prometheus HTTP APIs and collect the related data.

Below are the OpenShift commands to get the Prometheus route and token that are needed to get a connection:

oc get route prometheus-k8s -n openshift-monitoring --template='{{.spec.host }}'
oc sa get-token -n openshift-monitoring prometheus-k8s

An HTTP request is constructed from the URL, header, and payload. Using the Python requests module, you can easily create requests, as shown below. You need to have an oc login already done before running the script. The output is in JSON format. Below is the sample source code.

#!/usr/bin/env python3
import requests
import subprocess
import json
#construct HTTP URL
prometheus_route_cmd = "oc get route prometheus-k8s -n openshift-monitoring — template=’{{ .spec.host }}’"prometheus_route = subprocess.check_output(prometheus_route_cmd, shell=True).decode("utf-8").rstrip()url="https://{}/api/v1/query".format(prometheus_route)#construct HTTP headerprometheus_token_cmd = "oc sa get-token -n openshift-monitoring prometheus-k8s"prometheus_token = subprocess.check_output(prometheus_token_cmd, shell=True).decode("utf-8").rstrip()headers_string = """{{"Authorization": "Bearer {}"}}""".format(prometheus_token)headers = json.loads(headers_string)#construct HTTP payloadquery="sum by (node)((sum by (node,pod,namespace)(kube_pod_container_resource_requests_cpu_cores{container!=\"\"}))* on (pod,namespace) group_left()(sum by (node,pod,namespace) (kube_pod_status_phase{phase=~\”(Running).*\"} == 1)))"duration = "2m"
resolution = "30s"
payload_string = """{} [{}:{}]""".format(query,duration,resolution)
payload = {'query': payload_string}
#run the HTTP request
r = requests.get(url, headers=headers, params=payload, verify=False)
#get the HTTP result
r.json()

Summary

It’s a challenging task in the era of cloud computing to ensure the stability, availability and consistent performance of any system. An environment running Cloud Pak for Data, a cloud native Data and AI platform, is no exception. This makes systematically monitoring the cluster extremely important to support proactive measures.

This article has covered some basic steps and best practices on monitoring such environments. The basic steps allow early insight of your system to proactively address potential constraints and risks early. Ongoing monitoring following a systematic approach can cut downtime and helps quickly identify any high-level capacity or configuration issues. A stable, consistently performant system means happy customers and is essential for business success.

Acknowledgement

The author would like to thank Heike Leuschner, Yuan-Hsin Chen and Eling Chen for their contribution to the content and developing some of the best practices.

References

Cloud Pak for Data 3.5.0 knowledge center
Prometheus Query language
Prometheus HTTP API
Python request module

--

--

Yongli An
IBM Data Science in Practice

Senior Technical Staff Member, Performance architect, IBM Data and AI. Love sports, playing or watching. See more at https://www.linkedin.com/in/yonglian/