Demystifying Kernel Dumps on OpenShift

6 min readOct 28, 2023

In the world of Kubernetes, managing containers and orchestrating applications is akin to sailing a modern vessel through the vast and ever-changing seas of technology. But as any sailor knows, even the sturdiest ships may encounter turbulent waters, sometimes leading to unexpected shipwrecks. When your Kubernetes nodes experience kernel panics or system crashes, it’s like stumbling upon a hidden treasure chest on a mysterious island. To uncover the riches, you need a map, and in the digital realm, that map is the kernel dump.

Imagine a kernel dump as the X on a treasure map, revealing the secrets of what caused your system to hiccup or even halt. Just as intrepid adventurers decipher cryptic clues to find the buried loot, we’ll embark on a journey to demystify the world of kernel dumps on OpenShift nodes. This article will provide you with a step-by-step guide on how to generate, read, and interpret these digital treasure maps, enabling you to uncover the precious backtraces hidden within.

Whether you’re a system administrator, a DevOps engineer, or simply someone with a curiosity for digital exploration, this article aims to equip you with the knowledge and tools necessary to decode the mysteries of system crashes in your OpenShift cluster. So, imagine yourself as a modern-day pirate, ready to sail the high seas of kernel dumps, unravel the secrets of your digital shipwrecks, and unearth the vital information needed to navigate your OpenShift environment safely.

Before we get our hands dirty & wet, Let’s realize what we are about to do.

In order to accomplish this process from end to end We’re going to enable kdump on our cluster’s nodes as our kernel crash dumping mechanism, We will install Crash utility on our bastion as our ‘dump analysis’ tool and we will import & install the relevant kerneldebug-info in order to give crashthe availability to analyze the kernel dump properly.

Howy ! sailors lets start the journey.

Pre-Requisites

Openshift cluster admin permissions
Access to the cluster’s nodes
Bastion (or dedicated machine)
RHEL Subscription for the relevant repos

Step by Step guidelines

Enable Kdump

To enable Kdump and configure it proprley we will create a MachineConfig .

99-worker-kdump.yaml:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-kdump
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
        - contents:
            compression: ""
            source: data:,path%20%2Fvar%2Fcrash%0Acore_collector%20makedumpfile%20-l%20--message-level%207%20-d%2031%0A
          mode: 420
          overwrite: true
          path: /etc/kdump.conf
        - contents:
            compression: gzip
            source: data:;base64,H4sIAAAAAAAC/2SPvW4qMRBGe55i5J6rS3oXCFYIkQVElCid5bVnvVbGHuOfTcLTR4lEmlTfqT6dc9g+92e1OfX9+rh93B87den600snxdQcJu2wwC/doFAblMWhObg2jxWInRraqAgjlHfPlQax+Hu6Pp+741YKn6+JiSBmZVIrcgUZC1ZlcfYGCxiXuSVlfdEDoQwYOH9CMCh5HCG2oH+gWZz/mcmTzRiXQX/IB0g6eiNX/yEz15G0KzLyqD2BNsmryCpgmLgmag5q1rEknTFWdc+TEWfMEPlNF/reORjOaC1MWOrdScDi0L12G7W+7J6kWJZ77r7fSTEH8rHdxOIrAAD//7olvARYAQAA
          mode: 420
          overwrite: true
          path: /etc/sysconfig/kdump
    systemd:
      units:
        - enabled: true
          name: kdump.service
  kernelArguments:
    - crashkernel=256M

oc apply -f 99-worker-kdump.yaml

kernelArguments — Provide kernel arguments to reserve memory for the crash kernel.
contents — changing the contents of ‘/etc/kdump.conf’ and ‘/etc/sysconfig/kdump’

/etc/kdump.conf :

path /var/crash
core_collector makedumpfile -l --message-level 7 -d 31

/etc/sysconfig/kdump :

KDUMP_COMMANDLINE_REMOVE="hugepages hugepagesz slub_debug quiet log_buf_len swiotlb"
KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd hest_disable" 
KEXEC_ARGS="-s"
KDUMP_IMG="vmlinuz"

2. Access to one of your cluster’s node and check if kdump is active

systemctl status kdump

NOTE To Enable ‘kdump’ on each node there are consequences:

Less available RAM due to memory being reserved for the crash kernel.
Node unavailability while the kernel is dumping the core.
Additional storage space being used to store the crash dumps.

You might want to have additional kdump-enabled node.

Force a kernel crash

In order to test kdump and also to learn how to analyze kernel crash we will force a kernel crash on a specific node by executing

echo c > /proc/sysrq-trigger

Here’s what this command does:

echo: This command is used to write data to a file or file descriptor in the shell.
c: In this case, it's the data being written. The letter "c" corresponds to the "crash" action.
/proc/sysrq-trigger: This is the special file in the /proc directory that allows you to send various commands to the kernel for debugging and system management purposes. Writing "c" to this file triggers a kernel crash.

2. After the kernel crashed let’s check what happening under /var/crash

ls /var/crash
ls /var/crash/<timestamp-dir>

We can see a dir with timestamp name has been created and It contains three files: vmcore,vmcore-dmesg.txt,kexec-dmeg.log

But what is this vmcore ?

A vmcorefile, often named “vmcore” or “vmcore.X,” is a memory dump file that contains a snapshot of the system’s physical memory at the time of a kernel panic or system crash in a Linux or Unix-like operating system. This file captures the state of the system’s memory, including the kernel, running processes, and other data structures when a critical error or kernel panic occurs.

Think of a vmcore file as a freeze-frame captured during a movie's dramatic scene. Just as a freeze-frame captures a specific moment in the movie, a vmcore file captures a snapshot of a computer's memory at a critical point in its operation.

Imagine you’re watching a thrilling action sequence in a movie. During an intense moment, you hit the pause button, freezing the image on the screen. That frozen frame shows you everything that was happening at that exact instant: the hero, the villains, the environment, and the intensity of the action. Similarly, a vmcore file is like that pause button for a computer system. It captures the entire memory of the system, including the core components like the "heroic" kernel and the "characters" represented by running processes.

Alright folks let’s continue.

Install Crash & Configure

Access to you bastion (or additional machine), Install ‘crash’ utility

dnf install crash -y

To use crash we MUST have two things the vmcore and the vmlinux with the specific vmcore’s kernel version.

You know what is vmcore so let’s understand what is vmlinux and why we need it.

vmlinux is the uncompressed, statically linked Linux kernel binary that contains the core code and data structures of the Linux kernel. In our case crash using It’s symbol information in order to map the kernel’s behavior and function calls during the ‘frozen frame’ we captured.

2. Import from you node the vmcore by copying to bastion

scp core@node-1:/var/crash/<timestamp-dir>/vmcore .

3. Now all we have left is to make sure we having the current vmlinux so his symbol information will valuable to the vmcore.

To accomplish that we have two options

Option A:

If our kernel is in the same version of the node’s kernel

if uname -r (bastion) == uname -r (openshift's node)

2. We can enable the relevant yum repos and install the kernel-debuginfo and kernel-debuginfo-common

subscription-manager repos --enable=rhel-8-for-$(uname -m)-baseos-debug-rpms --enable=rhel-8-for-$(uname -m)-appstream-debug-rpms

yum install kernel-debuginfo-$(uname -r) kernel-debuginfo-common-$(uname -m)-$(uname -r)

Option B:

1.Navigate to https://access.redhat.com

2. Downloads → Red Hat Packages → search for the kernel-debuginfo and kernel-debuginfo-common with your openshift’s node kernel version.

3. Download the packages and Install them on your bastion

rpm -i kernel-debuginfo-<node-kernel-version> kernel-debuginfo-common-<node-kernel-version>

NOTE you can use — old-package if necessary

Check if you new vmlinux created under /usr/lib/debug/lib/modules/<kernel-version>/

Use crash to analyze

To use crash

crash /usr/lib/debug/lib/modules/<vmcore-kernel-version> <path-to-vmcore>

To display the kernel message buffer, type the log command
To display the kernel stack trace, type the bt command or bt pid to display the backtrace of the selected process
To display status of processes in the system, type the ps command

You are now able to analyze kernel crash and extract backtraces from it !