Improve Cluster Balance with Cloud Pak for Data Scheduler — Part 2

Yongli An
IBM Data Science in Practice
8 min readJul 5, 2023

The default Kubernetes scheduler has some limitations that cause unbalanced clusters. You can see more details in part 1 on this topic.

In an unbalanced cluster, some of the worker nodes are overloaded and others are under-utilized. Such clusters are not good for resource usage efficiency and a consistent performance experience. In this article, “cluster balance” means that the resource usage is balanced more or less across the worker nodes. we will use “cluster balance” and “resource usage balance” interchangeably.

Part 1 introduced the IBM Cloud Pak for Data (CPD) scheduler with explanations on how it works. Part 1 includes proof points showing the CPD scheduler can mitigate cluster balance issues. The proof points are from both dynamic workloads and a simple initial CPD deployment.

This article continues on the same topic as part 2. It covers more comprehensive test cases. We will show how the CPD scheduler can significantly and consistently improve cluster balance.

Past experience using the default k8s scheduler shows indicates unpredictability in cluster balance. The resource usage could vary a lot across the cluster after a set of applications or services are deployed into the cluster. Factors for this unpredictability could include the following:

  • cluster size;
  • total number of services and the total amount of resources initially need to be allocated for the deployed services (ie. the total request from all the pods);
  • the total resource needed relative to the cluster size; or
  • the timing from doing one deployment to another with the same services in the same cluster.

To ensure high confidence on the test results, we have designed more tests. These tests cover various conditions for Initial Services Installation by considering above factors.

Let me explain it more in the next section.

Design the Tests

We want to confirm whether the CPD scheduler consistently provides better resource usage balance in a cluster. we designed a set of initial installation tests to cover various conditions. Then we run the tests to check the CPD scheduler behavior and its impact on cluster balance.

The tests are designed to focus on the differences between the cluster capacity and total deployment size. Deployment size is measured by the total of the request settings from all the application pods.

To be specific, the tests cover the following three major conditions.

  • High resource usage pressure (150% of total request)
    - Small size cluster relative to the allocated resource
    - cluster CPU capacity = total CPU request x 150%; that is ~67% of cluster capacity is allocated
  • Medium resource usage pressure (200% of total request)
    - Medium size cluster relative to the allocated resource, cluster
    - CPU capacity = total CPU request x 200%; that is ~50% of cluster capacity is allocated
  • Low resource usage pressure (400% of total request)
    - Large size cluster relative to the allocated resource
    - cluster CPU capacity = total CPU request x 400%; that is, only 25% of cluster capacity is allocated

We use the request setting as the main factor in the test design. This is because the pod request setting is the metric used in the scheduling algorithm. When increasing the number of services in the deployment, the CPU request total increases. As a result, the total allocated resources in the cluster increase for both CPU and memory. This leads to increased resource usage pressure on the cluster.

To achieve the above conditions, we created the following combinations:(while working under the resource limitation in the lab)

Table 1: The Deployment Combinations

Table 1: The Deployment Combinations

**Note: To contrast with the default scheduler’s behavior, the default scheduler was tested as well. But we only tested the default scheduler for the 3 combinations marked in Table 1. They are representative enough and there is no need to repeat other combinations for similar results.

Each worker node has 16 vCPUs, 64GB memory. We run each of the 9 installation combinations at least 3 times to check the consistency of the scheduler behavior. But we skipped the large cluster for heavy deployment, due to resource availability.

Data Collection and Metric Calculation

To quantify how balanced a cluster is, we use the allocated CPU percentage number from the `oc get nodes` command. We run the command after the deployment process is fully completed.

Below is one such sample output for one worker node:

Resource           Requests     Limits
-------- -------- ------
cpu 2294m (14%) 15450m (99%)
memory 5426Mi (8%) 23016Mi (37%)
ephemeral-storage 428Mi (0%) 1312Mi (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>

The example above shows 14% of CPU capacity on the worker node is allocated (ie. CPU resource allocation percentage).

Below are the steps taken as part of the result analysis:

  1. all the CPU allocation percentage numbers from all the worker nodes in the cluster will be captured as part of the `oc command`
  2. the maximum and minimum values from all the worker nodes are identified
  3. let “balance indicator” = CPU resource allocation percentage variation, which is calculated as “maximum allocation percentage — minimum allocation percentage” (lower the value is, smaller the gap is, more balanced the cluster is, vice versa)

Baseline from the Default Kubernetes Scheduler

With that definition in mind, the following graph shows the balance indicator values from the default k8s scheduler. The graph also includes the average allocation %, maximum allocation % and minimum allocation %.

In the test systems, the default scheduler is customized with “least CPU requests” policy.

Figure 1: Cluster Balance Results from the Default Scheduler

Figure 1: Cluster Balance Results from the Default Scheduler

The 3 combinations above are chosen to cover the 3 representative resource usage pressure levels. The highest is from a heavy deployment in a relatively small cluster. The lowest is from a light deployment in a relatively large cluster. We expect the default scheduler will lead to unbalanced clusters. Hence it’s expected to see the following:

  • the average allocation percentage goes from high to low
  • the minimum allocation percentage may likely go from high to low
  • the maximum allocation percentage may likely go from high to low

What is hard to predict is how the balance level changes among these 3 combinations. We measure the balance level based on the gap between maximum and minimum allocation percentage. This gap is shown by the “balance indicator” bars on the left side of the graph above. The medium usage pressure scenario from using the medium cluster with medium deployment showed the biggest gap as 67%. The light usage pressure scenario using the large cluster with a light deployment has a relatively more balanced cluster.

The balance level variation also happens often in the field and in our day-to-day internal deployments. We see lack of repeatability and consistency. We also know that repeatability is important for a system to maintain good performance experience.

Results from the CPD scheduler

This section will show the cluster balance results from the combinations covering the 3 resource usage pressure levels.

In the test systems, the CPD scheduler is installed with “least CPU requests” policy enabled.

# To enable the feature, edit the configmap ibm-cpd-scheduler-scheduler and 
# set nodePreference to LessCPURequest.
# Restart the scheduler pod ibm-cpd-scheduler-scheduler-* in ibm-common-services.
# Configure node preference, valid values are:
# LessGPURequest LessCPURequest LessMemRequest LessGPULimit LessCPULimit
# LessMemLimit
# To configure more than one values, separate the values by space, like
# "nodePreference: LessCPURequest LessMemRequest"
nodePreference: LessCPURequest

High Usage Pressure Deployments

Figure 2 below shows the balance indicator and other metrics from the 3 deployment combinations. These are the combinations under the first column in Table 1. The tests focus on the high resource usage pressure scenario.

Using the CPD scheduler, the balance indicator values for all three deployments are under 20%. In particular, medium and heavy deployments have balance indicator values slightly above 10%. This indicates the clusters are very balanced. You can see the same pattern based on no big gaps among the other 3 resource allocation metrics (average, max and min).

Figure 2: High Usage Pressure Deployment Balance Results

Figure 2: High Usage Pressure Deployment Balance Results

Let’s compare with the baseline for the high resource usage pressure deployment. The baseline balance indicator value is 46%. That indicates a very unbalanced cluster. The unbalanced cluster has the least loaded node being 53% allocated and the heaviest loaded node being 99% allocated.

Medium Usage Pressure Deployments

Figure 3 below shows the balance indicator and other metrics from the 3 deployment combinations. These are the combinations under the second column in Table 1. The tests focus on the medium resource usage pressure scenario.

The balance indicator values for all three deployments are under 20%. In particular, the medium and heavy deployments have balance indicator values slightly above 10%. This indicates the clusters are very balanced. You can conclude the same based on no big gaps among the other 3 resource allocation metrics (average, max and min). All the other 3 metrics in these medium usage pressure deployments are lower than those from the high usage pressure deployments, as expected.

Figure 3: Medium Usage Pressure Deployment Balance Results

Figure 3: Medium Usage Pressure Deployment Balance Results

Let’s compare with the baseline for the medium resource usage pressure deployment. The baseline balance indicator value is 67%. That indicates a very unbalanced cluster. The unbalanced cluster has the least loaded node being 53% allocated and the heaviest loaded node being 99% allocated.

Low Usage Pressure Deployments

The Figure 4 below shows the balance indicator and other metrics from the 3 deployment combinations. These are the combinations under the last column in Table 1. The tests focus on the low resource usage pressure scenario.

The balance indicator values range from 15% to 23%, a bit higher than the other 2 scenarios, still indicating very balanced clusters. You can conclude the same based on no big gaps among the other 3 resource allocation metrics (average, max and min). All the other 3 metrics in these low usage pressure deployments are lower than those from the medium usage pressure deployments. This is no surprise.

Figure 4: Low Usage Pressure Deployment Balance Results

Comparing with the baseline for the low usage pressure deployment, the balance indicator values are similar. Both cases have no major cluster balance issues. But it’s likely that the balance result from the default k8s scheduler won’t be as consistent or repeatable as it is from the CPD scheduler.

Conclusions

We have carefully designed more tests and analyzed the results that are repeatable. There is no doubt that the CPD scheduler works better than the default k8s scheduler. The CPD scheduler gives more balanced clusters under various resource usage pressure levels. If you are seeing issues caused by cluster imbalance, remember CPD scheduler can be a great option to help.

Acknowledgments

The author would like to thank the colleague Jun Zhu who helped deliver the test results in this report, Jun Feng Liu and Michael Closson for supporting us on this effort.

References:

--

--

Yongli An
IBM Data Science in Practice

Senior Technical Staff Member, Performance architect, IBM Data and AI. Love sports, playing or watching. See more at https://www.linkedin.com/in/yonglian/