Using IBM Turbonomic for Monitoring Cloud Pak for Data — Part 1

Built-in metrics from basic UI navigation

Yongli An
IBM Data Science in Practice
11 min readNov 24, 2023

--

Overview

In this article, we share our experience with IBM Turbonomic in Cloud Pak for Data (CP4D) environments. Our experience focuses on using monitoring capabilities to help our development teams to analyze resource usage and identify performance and resource optimizations. What we have learned should be helpful to you as well. The Turbonomic monitoring capability shows a clear view of what’s going on in your development, testing, and production environments at various levels. With better visibility and observability, timely actions can be taken to prevent any unplanned production outage from happening.

Why Turbonomic? Better together with monitoring

While Kubernetes provides the benefits of application elasticity, it can be hard to manage the platform complexity with dynamic workloads and infrastructure. When you face challenges with resources, the initial starting point is to see what’s going on in the cluster at the node level, and then at the pod and container level within the various applications. These initial steps are considered basic monitoring before taking any corresponding actions. While OpenShift Container Platform offers tools to monitor various layers of a cluster, it doesn’t make any recommendations (refer to this monitoring blog). Turbonomic provides unique advantages. It simplifies the process by having built-in key metrics. With the reporting feature, you can use Turbonomic to create customized Grafana dashboards in easy steps to capture the most important metrics.

Turbonomic can be used for monitoring the Red Hat OpenShift clusters running CP4D workloads to help customers achieve optimal CP4D performance in different environments, such as public cloud or on prem. You can try this yourself — Turbonomic is now available for use with Cloud Pak for Data by using a free, limited-time Evaluation Edition.

Turbonomic installs onto a Red Hat OpenShift cluster in a single namespace deployment, by using a certified operator available through the Red Hat OpenShift Container Platform Operator Hub, making it an easy choice for those already running CP4D in Red Hat OpenShift clusters.

In addition to the monitoring capability, Turbonomic automatically determines the right resource allocation actions. It then makes recommendations around how to scale, provision, deprovision, and optimize resources for performance, cost, and availability. These optimization actions are in read-only mode in the Evaluation Edition and aren’t covered in this article. These more advanced scenarios might be disruptive operations because pods must restart to resize. Such advanced use requires extra considerations and further validation. Nevertheless, it is still valuable having Turbonomic as a powerful monitoring tool in any CP4D environment.

What metrics and why

It’s a relatively involved task to monitor Red Hat OpenShift clusters, analyze various metrics, and identify the performance bottlenecks. This requires hands-on skills and experience in these various environments. It also demands a high level of performance knowledge and practices to do the job efficiently. Turbonomic offers an easy way to accomplish some of the core steps automatically, with flexibility to build reporting dashboards focusing on metrics that matter the most. Turbonomic offers the potential for seamless observability, enhanced visibility through reporting, and streamlined troubleshooting.

Out of the many metrics in the system, there are a few most important ones that we recommend our users always monitor. These metrics have been used in our day-to-day performance engineering for Cloud Pak for Data, either through manual steps or homegrown scripts-based dashboards. Now all of these are readily available in Turbonomic.

In this article, we focus on the benefit of Turbonomic’s monitoring capabilities to help you understand the resource usage situation in the CP4D cluster.

This is part 1 of the article, in which we introduce some basic UI navigation to show the built-in metrics. Because there are limitations in what’s offered, we go to the report service for help, which will be explained in part 2 of this article.

This article assumes that you have a Turbonomic server up and running, and the client-side setup (also referred to as the Kubeturbo probe) is available in your CP4D cluster. For more help, refer to the reference section at the end of this article for installation and setup instructions.

UI navigation for basic monitoring

Turbonomic offers a list of views and metrics focusing on different aspects of the objects and entities in the Red Hat OpenShift clusters. This helps you evaluate the resource efficiency and any potential performance constraints.

Given the rich functions offered by Turbonomic, we strongly recommend that you focus on the metrics, charts, and navigation scenarios covered in this article. These are validated for use in CP4D. While other metrics and information might be useful, we are not able to confirm sufficiently to offer guidance at this point.

Here are some navigation tips:

  • Avoid using your browser controls to go back or forward.
  • Instead, use the navigation buttons in Turbonomic, such as:
    - The go back or go forward arrows in the top left corner.
    - The clickable tabs and links to switch to different pages.
  • Use the links within views to open tabs that provide a drill-down view.
  • When available, you can click the X icon in the top right corner to close the current view. This returns you to the preview view.
  • If you know the object or resource you are looking for, you can use the Search button in the left navigation.
the search button in the left navigation

Scenario 1 — Cluster-level monitoring

Usually, performance analysis requires you to start from the cluster level and dig down, layer by layer, to find the root cause. This scenario demonstrates how to use Turbonomic for high-level monitoring and analysis.

First, log in to Turbonomic using your user ID and password. Depending on the type of permission that you have for your user ID, you might have access to more than one cluster. We want to first focus on the cluster of interest. To do so:

  1. Click the Container Platform Cluster entity ring.
  2. Switch to the List of Container Platform Cluster tab.
  3. Search for the short name of the cluster to investigate.

At this point, your scope should the cluster of interest. You can review the charts on the Overview tab. Some of the actions are:

  • Change the chart time frame. The default is 24 hours.
  • Expand each chart with the SHOW ALL button.
  • Sort by certain metrics with the SORT BY drop down.

Let’s look at two useful charts for the cluster level monitoring scenario.

The Top Virtual Machines chart provides vCPU and memory usage over request and limit at the node level. For each node, there are six metrics you can use to see whether there are any resource constraints. Specifically, we want to focus on these four metrics:

· Virtual CPU: vCPU usage vs. node capacity

· Virtual CPU Request: vCPU request vs. node capacity

· Virtual memory: memory usage vs. node capacity

· Virtual memory Request: memory request vs. node capacity

Next, the Top Namespaces chart provides the resource usage, request, and limit at the namespace level. You can scope the chart to a specific namespace for further investigation.

Below is one example to show the top namespaces sorted by VIRTUAL CPU USED. As expected, the namespace zen is at the top.

Scenario 2 — Dig deeper by looking at the containers/pods

Once you analyze at the high level, you might want to dig a little deeper to see details at the container/pod level.

For this navigation flow, start with the same steps as scenario 1. Also, you can use the go back arrow in Turbonomic or click Home.

When you see the hierarchy tree in the left panel:

  1. Click the Container Platform Cluster entity ring.
  2. Switch to the List of Container Platform Cluster tab, and search for the short name of the cluster of interest.
  3. Click the Namespace entity ring, switch to the List of Namespaces, and search for the namespace that you want to investigate. For example, type zen in the search box and click the namespace zen.

Now, your view is scoped to this one namespace of the cluster. You can review the charts in the Overview tab but click Show all to focus on the Top Container Spec chart. The Top Container Spec chart gives us the container level information with the following five metrics:

  1. Virtual CPU: vCPU usage vs. Limit
  2. Virtual CPU Request: vCPU usage vs. Request
  3. Virtual memory: memory usage vs. Limit
  4. Virtual memory Request: memory usage vs. Request
  5. CPU Throttling

Depending on what kind of analysis you are doing, you can sort the list of container/pods the way you need.

For example, if you are investigating if there any resource constraints, you might want to sort by VIRTUAL CPU or MEMORY in descending order, which shows you the containers with the highest vCPU or memory usage as compared to their limit. If you are investigating if any pods have the potential for some pod setting reduction, you might want to sort by VIRTUAL CPU or MEMORY in ascending order, which shows you the pods with the lowest usage percentage as compared to their limit.

Scenario 3 — Analyzing a problematic container/pod

If you have a problematic container/pod to target, you can bypass some of the steps and go straight to the source.

Log in using your user ID and password. From the home page, click the Container Platform Cluster ring and select the cluster name to scope to the targeted cluster. Click the Container Specs entity ring to navigate directly to the Container Specs view of the cluster.

Next, switch to the LIST OF CONTAINER SPECS tab and search for the container name.

Select the container/pod that you want to investigate and review all the charts on the OVERVIEW tab. For most, your focus should be on the Capacity and Usage chart, which shows details the resource usage at the container/pod level. For example, enter spark-hb and click the following container. You see this view:

The last column shows the current utilization as compared to capacity.

Some of the charts with names that contain Multiple Resources include the history of the metrics. Seeing the history is important. Some patterns and trends can be identified for monitoring and problem determination.

When you focus on one container, the chart is much easier to view because it includes few metrics. You can also click any of the metrics to show or hide. This example chart shows the four important metrics while skipping others. The container in this example seems to be healthy.

At the time of publication, we noticed a limitation that no dynamic pods, which started in the past and completed, are included in the views or the graphs, even with the historical view. Some of them might show up if they are still running when you are browsing the charts. The advanced report feature addresses this limitation to some degree, which will be covered in part 2 of this article.

Other best practices and reminders

For any serious applications such as CP4D in production, it’s required to have a proper process that governs preproduction testing and change control in production. Most of our customers start with preproduction load testing by using realistic data and realistic load level to evaluate and ensure quality and performance before moving to production.

In those preproduction test environments, Turbonomic is a very useful to help understand if any potential node capacity issues or pod level constraints. The required actions might be either scaling up or down the CP4D service to the next t-shirt size or custom tuning of certain pods of a service after consulting IBM support. Once the changes are validated and finalized in preproduction, such changes can be rolled out to the production environment within a planned time window for minimal impact.

In the production environment, Turbonomic can be used to continue monitoring the system and help decide any proactive corrective actions. Most likely rule-based automatic tuning or automatic scaling isn’t acceptable when such actions cause service disruption.

For those interested in the recommendations made by Turbonomic, the test environments can be used to apply those recommendations and validate their impact. Some of those recommendations, when validated in test environments, can be promoted following your standard change control and promotion process. In other words, even if you have a fully licensed version of Turbonomic, it’s not recommended to enable auto execution of the recommended actions in production.

At the time of publication, some additional integration work between CP4D and Turbonomic is still pending before such seamless auto execution of the recommended tuning can work properly in the testing environment.

With the Turbonomic Evaluation Edition, we expect everything to work as expected for enhanced visibility and observability (excluding the action/execution part). But be aware that some environment and infrastructure issues might cause Turbonomic to stop working properly. Users should try the typical practices of restarting pods and/or wait for environmental issues to settle down. If still not able to recover, it’s better to re-create the setup assuming that there is no need to keep and recover the old data. There is no formal support with the Evaluation Edition.

Summary

Turbonomic is a powerful tool that you can use to gain much better visibility into the Cloud Pak for Data at the cluster, service, and pod level. This also means much enhanced observability. The built-in views cover a wide range of entities and metrics. Further, you can refer to part 2 of the article. In part 2, you learn how the advanced report dashboards make tracking the most important metrics much easier. You can focus on analyzing the trends and patterns because the key metrics are constantly and continuously available in the Grafana dashboards that are pre-built for Cloud Pak for Data clusters.

Authors

Yongli An is a Senior Technical Staff Member and technical leader as part of the IBM Data and AI development organization. He currently works with various experts and stakeholders focusing on optimizing the performance, scalability, elasticity, stability, and serviceability of Cloud Pak for Data. You can reach him at yongli@ca.ibm.com.

Judy Liu is a member of the Cloud Pak for Data Platform Performance team. She has over 10 years of experience improving the performance of various IBM products. Her latest role has her focusing on performance optimization, scalability, and sizing of Cloud Pak for Data. You can reach her at judyliu@ca.ibm.com.

Acknowledgment

The authors like to thank Eva Tuczai and her colleagues in the Turbonomic organization for their continued collaboration and support to improve the integration between Turbonomic and Cloud Pak for Data. Eva is an IBM Turbonomic Product Manager, who has focused on Kubernetes and OpenShift Container Platform optimization for 7 years. You can reach her at eva.tuczai@ibm.com.

References

--

--

Yongli An
IBM Data Science in Practice

Senior Technical Staff Member, Performance architect, IBM Data and AI. Love sports, playing or watching. See more at https://www.linkedin.com/in/yonglian/