Troubleshooting CrashLoopBackOff with Stratoshark

Nigel Douglas
7 min readJan 20, 2025

--

CrashLoopBackOff issues in Kubernetes are among the most common challenges developers and operators face. These errors occur when a pod repeatedly crashes and is restarted by Kubernetes, signalling underlying issues that need attention. In this blog, we’ll dive into how you can use Stratoshark to analyse system call activity during a CrashLoopBackOff scenario. By capturing and analysing system calls, you can better understand the root cause of the issue and refine your troubleshooting process.

To set the stage, I created a basic environment on AWS using a standard Ubuntu 20.04 AMI instance. After setting up the instance, I installed ContainerD as my container runtime and Cilium as the CNI plugin to build a fully functional Kubernetes cluster.

Capturing CrashLoopBackOff Activity with Sysdig OSS

To analyse the issue, I installed Sysdig OSS (also referred to as Sysdig CLI), a powerful open-source tool for capturing and inspecting system call activity. If you haven’t already installed Sysdig, here’s the command I used on my Linux instance:

curl -s https://s3.amazonaws.com/download.draios.com/stable/install-sysdig | sudo bash

Once installed, I wanted to capture a 30-second snapshot of all system call activity related to containers, excluding anything from the Ubuntu host itself. Here’s the command I ran to create the .scap capture file:

sudo timeout 30 sysdig -w crashloopbackoff.scap container.name!=host

The Unexpected Discovery in Stratoshark

After collecting the capture file, I opened it in Stratoshark, a tool designed to provide deep visibility into application-level behaviour by analysing syscall data. My goal was to filter the activity related to the nginx process using the query:

proc.name == "nginx"

Surprisingly, the filtered search in Stratoshark returned no results. This was unexpected since the query seemed correct, and I had just deployed an nginx workload designed to trigger CrashLoopBackOff activity. So, what could be causing this discrepancy?

In the following sections, we’ll investigate the root cause of this issue, highlight common pitfalls, and demonstrate how to ensure the necessary system calls are captured correctly for troubleshooting.

How I initially forced the CrashLoopBackOff

Below is an example of a Kubernetes pod YAML file that creates an nginx workload deliberately configured to result in a CrashLoopBackOff state. This configuration introduces a non-existent command in the args section, causing the container to fail during startup.

apiVersion: v1
kind: Pod
metadata:
name: nginx-crashloop
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:latest
args:
- invalid-command # Deliberate invalid argument
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
ports:
- containerPort: 80

So why didn’t the image nginx show any activity for a process called “nginx” in my .scap file?

The issue here lies in the expectations for proc.name == “nginx” in the .scap file when the nginx container is crashing due to an invalid command. Here’s an explanation of what might be happening and how to troubleshoot it:

  1. Command Fails Before nginx Starts:
    The args parameter specifies an invalid command (invalid-command) that prevents the nginx process from starting. As a result, the nginx binary never gets executed, and no proc.name == “nginx” entry is created.

2. No Successful Execution:
The SCAP file only captures system calls from processes that actually run. If the container fails immediately during initialisation, no meaningful syscalls will be captured for the nginx process.

I ran the below command to check the log output of my pod, and I received an error message:

kubectl logs nginx-crashloop 
/docker-entrypoint.sh: 47: exec: invalid-command: not found

The error /docker-entrypoint.sh: 47: exec: invalid-command: not found confirms that the issue lies with the args field in your pod YAML file. The specified args (invalid-command) is not a valid command, so the container fails during initialisation without ever starting the nginx process. Consequently, there will be no syscalls related to proc.name == “nginx” in my SCAP capture.

Explanation of the Error

  1. The nginx image’s docker-entrypoint.sh script is responsible for starting the container.
  2. When I provide the invalid args, it replaces the default behaviour of the entrypoint script, attempting to execute invalid-command.
  3. Since invalid-command is not a valid executable, the entrypoint script fails, and the container crashes immediately.

Corrected YAML for CrashLoopBackOff

The current pod should go into a CrashLoopBackOff state when the container exits with a non-zero status (exit 1), and the restartPolicy: Always forcing Kubernetes to restart it repeatedly. However, if it did not behave as expected, there may be a configuration or execution issue. Here’s the revised version with adjustments to ensure my desired behaviour:

apiVersion: v1
kind: Pod
metadata:
name: nginx-crashloop
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:latest
command:
- /bin/sh
- -c
- |
# Start nginx (triggers system calls)
nginx -g "daemon off;" &
# Intentionally cause a failure after a short delay
sleep 2
exit 1
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
ports:
- containerPort: 80
restartPolicy: Always

Changes Made

  1. Command Fix:
  • Changed sh to /bin/sh for compatibility across Linux distributions.
  • Added an “&” to run nginx in the background while the script continues.

2. Crash Timing:

  • Introduced a sleep 2 delay to allow nginx to run briefly before the container exits with exit 1.

3. Verification Step:

  • Ensured restartPolicy: Always is set, which is crucial for the pod to restart repeatedly.

In the first terminal window, run the below command to confirm we are receiving system call activity from the nginx pod:

sudo sysdig proc.name=nginx

In the second tab, we can apply the nginx Pod YAML manifest, and check the pod status to confirm the processes are terminating due to the CrashLoopBackOff state:

kubectl apply -f https://raw.githubusercontent.com/NigelDouglas30/sysdig-inspect/refs/heads/main/nginx.yaml

Since we know that our nginx process is starting successfully this time around, let’s capture a 15 secure packet capture using Sysdig CLI:

sudo timeout 15 sysdig -w nginx-record.scap

Monitoring system call activity in Stratoshark

As always, let’s use the filtered search of proc.name == “nginx” within Stratoshark on the newly captured .scap file. We are seeing activity. Great!

With the below filtered search in Stratoshark, we can see all instances of write activity to log files — whether these are related to performance issues or security incidents:

evt.is_io_write == True && evt.dir == "<" && (fd.name contains "/var/log" || fd.name contains ".log" || fd.name contains "_log") && proc.name == "nginx"

The below filtered search will show all failed system calls in our system. Again, I have filtered this down for the process of “nginx” only.

(evt.failed == True) && !(evt.res == "EAGAIN") && !(evt.res == "EALREADY") && proc.name =="nginx"

Looking closer into the process hierarchy, you can spot the intentional command that I inserted in my nginx definition file to force this CrashLoopBackOff issue:

Using the below command, we can see the exact moment in time that the nginx pod definition file was applied. This helps security teams understand what was applied, by whom, and when those changes were made:

evt.type == "execve" and evt.dir == "<" and (proc.pname == "bash" or proc.pname == "zsh" or proc.pname == "tcsh" or proc.pname == "ksh" or proc.pname == "fish")

Conclusion

CrashLoopBackOff issues can be complex, but with the right tools and techniques, you can efficiently diagnose and resolve them. By combining Sysdig OSS for syscall capture and Stratoshark for detailed analysis, you gain unparalleled visibility into system-level activity in Kubernetes environments.

In this guide, we explored how to configure a deliberately failing nginx pod to generate system call activity while triggering a CrashLoopBackOff state. We demonstrated how to capture this activity, analyse it using filters in Stratoshark, and identify the root cause of the issue.

By crafting targeted YAML definitions, running precise filtered searches, and leveraging powerful tools like Sysdig and Stratoshark, you can ensure that every security incident or performance issue is traceable and actionable. This workflow is invaluable not only for resolving CrashLoopBackOff scenarios but also for improving overall Kubernetes observability and response capabilities.

--

--

Nigel Douglas
Nigel Douglas

No responses yet