Performance Monitoring Best Practices for IBM Cloud Pak for Data -Part 1
Overview
Performance and scalability have always been important aspects of any production system. It requires a systematic approach and ongoing monitoring to ensure performance and stability. This task becomes more critical and often more challenging in the era of cloud computing. There is no exception for environments running Cloud Pak for Data, IBM’s Data and AI platform.
Cloud Pak for Data offers a large number of services and runs in various cloud environments. Users enjoy its rich functionality and the benefit of cloud computing. At the same time, they need to address the challenge of ensuring consistent performance and stability. The good news is that there are simple steps to achieve this by using existing monitoring tools.
These two articles (Part 1 and Part 2) will share some basic steps and best practices that we have used in our day-to-day work. The steps leverage existing tools to cover the different layers of the cloud environment from Cloud Pak for Data layer to the RedHat OpenShift layer. These articles will cover what’s available from the different layers, and will explain how to get the usage metrics needed to understand the system health and performance.
To build the holistic view of the system, the key layers to be covered at the high level include
Application and Cloud Pak for Data level metrics
- Application-level metrics specific to your workloads
- Key Cloud Pak for Data platform and services level metrics from Cloud Pak for Data web client
Cluster-level basic monitoring
- More cluster-level metrics under monitor from the OpenShift Container Platform (OCP) console
- Additional level of details from the Grafana dashboards via OCP console
Advanced monitoring and analysis
Targeted monitoring using customized queries from Prometheus UI.
The instructions in these articles are good for the systems running Cloud Pak for Data 3.5 and OpenShift Cloud Platform 4.x.
Cloud Pak for Data Platform Management
The platform management page from the web client offers a convenient way of monitoring a set of important metrics. The users can use these metrics to get a sense of the system health and status.
Non-functional concerns normally start with users experiencing system slowness or application failures. As a reactive step, using the right tools to check the key metrics could lead to the right solutions. But an even better approach is to be proactive, with ongoing regular monitoring of the key metrics at different levels. This allows a chance to address the root causes before any real impact on users.
To achieve this, some measures should be put in place to cover the whole system stack.
First, the application should track its performance through some built-in application-level metrics. The metrics should include jobs per hour, or transactions per seconds, as a throughput, along with response times for the core API calls or HTTP requests. At this level, it will be important to retain these metrics to show historical trends and set up thresholds to trigger useful alerts. All of this is very workload specific, so details are beyond the scope of this article.
The next level should be looking at the “platform management” page from the Cloud Pak for Data Web Client. The console in Cloud Pak for Data 3.5 release has some very useful new enhancements.
To access this page, you need to log into the Cloud Pak for Data web client (aka console), as shown below for a hypothetical cluster.
Once you are on the home page, you will see “Welcome, admin!”. As shown above, click “Manage the platform” option you will be on the “Platform management” page.
As shown above, the page provides a very high-level view of what’s running on this cluster. Iin this case, an idle system has nine services installed and 249 pods running. The vCPU section shows the “currently in use” CPU amount, the total CPU requests and limits. The Memory section shows the “currently in use” memory amount, the total memory requests and limits.
In this example cluster, the usage is much less than the reserved resources as the system is idle. A healthy system with active loads can have the CPU and memory usage above the total requests by 30%-50%. But the usage should not be near the total limit, because total limit is expected to be over committed.
On the monitoring page, there are four tabs. The “Services” tab shows more details at the individual service level. Other tabs cover more information for service instances, environments, and pods.
For example, a user can see the aggregated usage, request, and limit data for CPU and memory at the service level. Users can use such information to identify the top usage or track the usage trend, and then plan ahead. Based on the same information, users can set the service level quota with proper values. This is to ensure one service cannot overuse resources on the cluster. This will control the impact from one service on other services.
Users can customize what’s in the view by clicking the “gear” icon (where the red arrow points to in the picture above). Given the limited width of the page, picking the ones of most interest will allow for an easy view of the key metrics. The choices presented above are a good example for performance-focused monitoring.
This customization is available at both the services level and the pod level. Users can click the service name to show only the pods of that service. Or they can see all the pods in one list under the “Pods” tab.
The following sections will cover what the OpenShift Container Platform (OCP) console offers, with different breakdown views at the node and pod level. Such breakdown views show historical usage and trend to help proactive mitigation.
OpenShift Cloud Platform console
The OpenShift Container Platform (OCP) console offers more options for monitoring. Once logged into the OCP Console, users will see the options under the “Monitoring” section.
- From the “Alerting” option, users can filter to look for high severity alerts. Users can check what alert rules are configured by default as well. This part is not the focus of this article.
- From the “Metrics” option, you can access the Prometheus UI as shown in the picture below. The next post has more details on running customized Prometheus queries.
- From the “Dashboards” option, the Grafana UI link will lead users to the Grafana dashboard. Note that sometimes you might need to log in again with the same kubeadmin credential
This section will show how to use cluster-level metrics in OCP console.
Once you are on the home page of the Grafana dashboard, click “Home”.
Then click “Default”, and a list of dashboards will be shown.
First, let’s check the “Kubernetes/Compute Resources/Cluster” dashboard. Note that the graph has a time frame selection in the upper right corner. By default, it shows the data points for the last 30 minutes. This setting can be adjusted to cover the duration of your interest.
The sample above shows the last 12 hours of CPU and Memory usage by namespace. The top row above the CPU usage graph shows the high-level metrics of resource usage, request, and limit commitments from all the services running at the cluster level.
Next is the “Kubernetes/Compute Resources/Node (Pods)” dashboard, which shows resource use by all the pods in a stacked graph on each individual node. Users need to pick each node one by one to see different nodes.
Multiple nodes can be chosen via multi-select in the node selection box, but the best way is to check one node at a time. The above picture shows the CPU and Memory usage for one worker node.
The “Kubernetes/Compute Resources/Namespace(Pods)” dashboard let the user zoom into the different namespaces. Users will see CPU and Memory usage by various pods of that chosen namespace. In the case of Cloud Pak for Data, this will be a namespace called “zen” by default. The dashboard will show which pods are the top resource consumers in the selected namespace.
By clicking one or more individual pods in the graph legend list, the graph will show only the usage of the selected pods.
Below are sample graphs for CPU and Memory usage from clicking the pod called “zen-metastoredb-2”.
Interestingly, the “Memory usage” graph above shows a drop to 0 between 22:00 and 23:00. That could indicate a restart of the pod for reasons such as a planned restart by the user or such problems as errors like “out of memory” (OOM). Further details can be gathered from running “oc describe <pod-name> — previous=true” to see what triggered the restart.
Here is an example log entry from a pod restart situation:
I201005 06:41:28.433027 1 cli/start.go:765 received signal ‘terminated’
I201005 06:41:28.447901 1 cli/start.go:830 initiating graceful shutdown of server
initiating graceful shutdown of server
Quick Recap
Up to this point, this post has covered how to use the metrics from the following layers to help the performance monitoring tasks:
- Application and Cloud Pak for Data-level metrics
- Cluster-level basic monitoring
This will conclude the Part 1 of this blog.
Part2 will cover the “Advanced monitoring and analysis” to show how to use customized queries from Prometheus UI for more targeted monitoring.
References
Cloud Pak for Data 3.5.0 knowledge center
Prometheus Query language
Prometheus HTTP API
Python request module