Fortifying IBM Cloud Private for large enterprise Power systems

E980 exploded view

Running distributed applications on large IBM Power servers can be technically challenging due to the server size and complexity. For example, an IBM Power AC922 server that drives the SUMMIT supercomputer can have 40 cores and 1 TB of memory. Similarly, the IBM Power Systems E980, which is a multi-node enterprise Power9 server, can have up to 192 cores and 64 TB of memory.

You can read more about the IBM Power Big Iron servers at the following nextplatform article.

It’s never good when end users encounter issues from the larger servers, but it is an opportunity for engineers to show their mettle. This article briefly describes some of the issues that came up when our team ran an instance of IBM Cloud Private on these large systems, and how the issues were overcome.

The Background

IBM Cloud Private is an application platform that includes the container orchestrator Kubernetes, a private image registry, a management console, and monitoring frameworks. Container deployment best practices recommend that you run all containers with resource limits. This ensures that the environment remains stable and one container doesn’t use all the resources and starves the others. Setting the right limits is not easy and required multiple iterations of running the containers in different environments.

The Problem

“When solving problems, dig at the roots instead of just hacking at the leaves.” — Anthony J. D’Angelo

A few weeks after working with IBM Cloud Private 3.1 in their IBM Power environment, some of our large enterprise customers reported issues. Two examples of the issues include the following:

  1. The system freezes after some time.
  2. There was a gateway timeout for the UI.

The problems appeared different and isolated, though one of the important things in resolving issues is the art of finding the linking thread. Service engineers on our team started debugging the issue and worked closely with the development team to identify the causes. We ran kernel dumps and analyzed system logs, Docker logs, and Kubernetes logs for all of the problems. The common theme was multiple out-of-memory errors on the system.

Interestingly, the system had enough free memory. The individual Kubernetes Pods were running out of memory. If you are familiar with Kubernetes, you might now guess that the out-of-memory errors were due to Pod limits being reached.

We were surprised that we never saw these out-of-memory issues in our benchmarking and testing. After more troubleshooting, our team discovered the following explanations for the frequent out-of-memory issues:

  • Certain combinations were not represented in benchmarking and tests. Power servers support baremetal (no hypervisor), KVM, and PowerVM hypervisors with different numbers of hardware threads = 2,4,8. The number of threads can be changed dynamically on the running system, and the default memory limits for some containers were not enough when running with a higher number of hardware threads (4 or 8) or with a large number of vCPUs.
  • Analyzing the kernel dump also pointed to memory exhaustion on the system. Out of memory errors and overall memory exhaustion after running for a few days indicated a memory leak in the system. The 3.10.X series kernel in Red Hat Enterprise Linux had bugs that resulted in a memory leak for short-lived containers. Containers that were stopped by out-of-memory errors are short-lived containers, which eventually consumed all of the available memory and the system became unstable. For more details about these issues, see Red Hat Bugzilla bug 1507149. Unfortunately, there is no easy kernel fix that is available other than upgrading to the latest 4.X series kernel. In many situations, upgrading the kernel is not an option.

The Solution

The critical thing to tackle was how to avoid out-of-memory errors by having optimal limits for the containers. Our testing indicated that there could be occasional spikes in the container memory usage, which could result in the limit being breached and the container being stopped by out-of-memory errors. We could increase the limit, but how much? Certain limits might be good to start, but the limits might need to be changed over time based on usage.

Our team decided to take the following actions to resolve these issues:

  • We added a small swap partition and disabled swap accounting, which controlled short spikes in the container memory usage beyond the set limit and avoided out-of-memory errors. Using swap in Kubernetes is a much-debated topic, which you can learn more about by reading Kubernetes Git issue 53533. A small swap space of 2 - 4 GB on solid state drives is enough to solve the problem, but it should not be shared with drives that manage large amounts of I/O activity, like /var/lib/docker and /var/log.
  • We determined whether the programs spawn worker threads by using a fixed number of threads or by using a factor of the total CPUs. Depending on which method is used, the result might not be what is expected. For example, nginx spawns worker processes that are based on the number of CPUs. For systems with a large number of CPUs, this can result in unintended consequences. Using a fixed number of threads is a more consistent choice.
  • We set up notifications to identify when memory limits were being reached. We needed a way to monitor the resource usage of the containers to determine whether any changes were needed to resource limits. Setting up these notifications can be done easily with Prometheus and Alertmanager. For example, you can add an alert that notifies you when the memory usage of a container is continuously above 90%. This indicates that the sizing might need to be increased.

You can find all the recommended settings for IBM Cloud Private in the Configuring for an IBM Power environment IBM Knowledge Center topic.

The problems that our team encountered were generic, but manifested much faster on larger scale-up systems. They also highlighted the need for testing the community code for various infrastructure combinations and scenarios. It’s a continuous learning process.

Hopefully, this article will help you with your continuing learning.

I will end this with one of my favorite quotes by an unknown source:

“Never stop learning because life never stops teaching.”