Debugging CrashLoopBackoffs with Init-Containers

Sidhartha Mani
Koki
Published in
4 min readFeb 6, 2018

TL;DR This blog post discusses the technique of using a debug container during the initialization of a Kubernetes Pod for diagnosing and fixing crashes.

Kubernetes pod crashes are one of the hardest issues to debug. There are a few that are well known and relatively easier to debug. However, there are some that are time consuming and hard to debug

The following are some well known Pod Startup issues related to Images:

  • ImagePullBackoff
  • ImageInspectError
  • ErrImagePull
  • ErrImageNeverPull
  • RegistryUnavailable
  • InvalidImageName

Here are some Pod issues associated with runtime:

  • CrashLoopBackOff
  • RunContainerError
  • KillContainerError
  • VerifyNonRootError
  • RunInitContainerError
  • CreatePodSandboxError
  • ConfigPodSandboxError
  • KillPodSandboxError
  • SetupNetworkError
  • TeardownNetworkError

CrashLoopBackoff encapsulates a large set of errors that are all hidden behind the same error condition. Here’s an error that I ran into a few days back:

  • A Pod in an AWS cluster was unable to talk to the AWS ec2 API. The logs simply said RequestError: send request failed caused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: i/o timeout”
  • Pod would exit with status 1 and crash-loop
  • The container image was a statically compiled binary. i.e. I could not exec into it.
Debugging Kubernetes Pods can be tricky sometimes

Debugging Steps

As with any Pod crashes, here are the usual steps that I follow:

  • kubectl describe pod
  • kubectl logs pod -c container (or) kubectl logs -p pod
  • kube-controller-manager logs (not applicable for above issue)
  • kubelet logs

These steps usually serve as a good start to understanding pod startup and crash issues. This time, however, they did not yield any useful breadcrumbs to follow.

The next step of action is to login to the host where the container is failing to start and check networking and syslog.

The issue in question is a networking issue — one container could not talk to an external endpoint.

We start by logging into the host that is running the pod. You can discover the host using this command:

kuebctl get pods -o wide

The Node column will contain information about the node on which the pod is scheduled. You can then SSH into the node (i.e. host).

Inside the host

Once inside the host. The first step is to check if the host’s networking permits talking to the external endpoint. If the host endpoint is unreachable via the host network, then the container will not be able to reach it either.

$ ssh admin@host
$ curl https://ec2.us-east-1.amazonaws.com
301 Permanently moved

This showed that the external endpoint was reachable via host network.

For some reasin, the Pod could not do the same.

The next debugging step is to look into syslog

Syslog includes docker daemon logs. Here’s the command to filter out docker logs

syslog -k Sender Docker

In this case, the docker daemon logs did not posses any useful information about this issue. This approach also failed to provide any useful debugging information.

Debugging with Init Container

Logging into the host running the container did not yield any useful breadcrumbs to follow. A method to log in to the container itself was required. However, this was not possible because the image was a static binary.

The next debugging step is to change the pod spec to include an init container with debugging tools and a fully working shell.

There are a few nuances to note here

  • The init-container should contain all the debugging tools required in the image. When debugging networking issues, it may not be possible to download debugging tools post startup.
  • The debug container should start with a long running command that can be gracefully shutdown. Commands likesleep infinity cannot be gracefully shutdown, and therefore the result of a diagnosis and rectification cannot be tested right away in the same pod lifecycle.
  • Init-containers are a better choice than normal containers, because they start sequentially and run before the actual containers of the pod. This allows the pod environment to be tested manually before running the containers in the pod.

Here’s a debugging container with useful debugging tools that I found in docker hub.

teran/ubuntu-network-troubleshooting

This debug container contains a working shell, and therefore it is possible to login into this container.

Since this container shares the same environment as the failing application container, the behavior of this container and the application container are exactly alike, except for any application specific setup.

In order to use this, update the pod spec to include the debug init container, and then kubectl replace the pod with the new pod spec. This will update the running pod, and you can log into it using the command.

kubectl exec -it pod -c debug

Once inside the debug container, you can debug environment issues like the issue stated above. In the case of my issue, I attempted to connect to the endpoint from inside the debug container:

$ curl https://ec2.us-east-1.amazonaws.com

The command just hung for a while. So, I killed the command (Ctrl+C) and checked if there was a route to the internet from this pod

$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=59 time=15.9 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=59 time=12.7 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=59 time=13.9 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=59 time=23.9 ms
64 bytes from 8.8.8.8: icmp_seq=5 ttl=59 time=12.9 ms

This showed that there was a route to the external world, but the curl command could not resolve it. I suspected a DNS issue.

I checked the kube-dns pods in kube-sys namespace and they were not running. This was due to a configuration error — an easy fix. Once kube-dns came up, the pod started working as well.

Conclusion

This debugging process seems really simple, but almost took a whole day for me. I hope this serves as a guide for others facing similar issues and helps resolve them sooner!

Stay tuned for Kubernetes deep dives!

--

--

Koki
Koki

Published in Koki

A complete platform for running applications on Kubernetes

Responses (2)