TL;DR This blog post discusses the technique of using a debug container during the initialization of a Kubernetes Pod for diagnosing and fixing crashes.
Kubernetes pod crashes are one of the hardest issues to debug. There are a few that are well known and relatively easier to debug. However, there are some that are time consuming and hard to debug
The following are some well known Pod Startup issues related to Images:
Here are some Pod issues associated with runtime:
CrashLoopBackoff encapsulates a large set of errors that are all hidden behind the same error condition. Here’s an error that I ran into a few days back:
- A Pod in an AWS cluster was unable to talk to the AWS ec2 API. The logs simply said
RequestError: send request failed caused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: i/o timeout”
- Pod would exit with status 1 and crash-loop
- The container image was a statically compiled binary. i.e. I could not exec into it.
As with any Pod crashes, here are the usual steps that I follow:
- kubectl describe pod
- kubectl logs pod -c container (or) kubectl logs -p pod
- kube-controller-manager logs (not applicable for above issue)
- kubelet logs
These steps usually serve as a good start to understanding pod startup and crash issues. This time, however, they did not yield any useful breadcrumbs to follow.
The next step of action is to login to the host where the container is failing to start and check networking and syslog.
The issue in question is a networking issue — one container could not talk to an external endpoint.
We start by logging into the host that is running the pod. You can discover the host using this command:
kuebctl get pods -o wide
The Node column will contain information about the node on which the pod is scheduled. You can then SSH into the node (i.e. host).
Inside the host
Once inside the host. The first step is to check if the host’s networking permits talking to the external endpoint. If the host endpoint is unreachable via the host network, then the container will not be able to reach it either.
$ ssh admin@host
$ curl https://ec2.us-east-1.amazonaws.com
301 Permanently moved
This showed that the external endpoint was reachable via host network.
For some reasin, the Pod could not do the same.
The next debugging step is to look into syslog
Syslog includes docker daemon logs. Here’s the command to filter out docker logs
syslog -k Sender Docker
In this case, the docker daemon logs did not posses any useful information about this issue. This approach also failed to provide any useful debugging information.
Debugging with Init Container
Logging into the host running the container did not yield any useful breadcrumbs to follow. A method to log in to the container itself was required. However, this was not possible because the image was a static binary.
The next debugging step is to change the pod spec to include an init container with debugging tools and a fully working shell.
There are a few nuances to note here
- The init-container should contain all the debugging tools required in the image. When debugging networking issues, it may not be possible to download debugging tools post startup.
- The debug container should start with a long running command that can be gracefully shutdown. Commands like
sleep infinitycannot be gracefully shutdown, and therefore the result of a diagnosis and rectification cannot be tested right away in the same pod lifecycle.
- Init-containers are a better choice than normal containers, because they start sequentially and run before the actual containers of the pod. This allows the pod environment to be tested manually before running the containers in the pod.
Here’s a debugging container with useful debugging tools that I found in docker hub.
This debug container contains a working shell, and therefore it is possible to login into this container.
Since this container shares the same environment as the failing application container, the behavior of this container and the application container are exactly alike, except for any application specific setup.
In order to use this, update the pod spec to include the debug init container, and then
kubectl replace the pod with the new pod spec. This will update the running pod, and you can log into it using the command.
kubectl exec -it pod -c debug
Once inside the debug container, you can debug environment issues like the issue stated above. In the case of my issue, I attempted to connect to the endpoint from inside the debug container:
The command just hung for a while. So, I killed the command (Ctrl+C) and checked if there was a route to the internet from this pod
$ ping 126.96.36.199
PING 188.8.131.52 (184.108.40.206) 56(84) bytes of data.
64 bytes from 220.127.116.11: icmp_seq=1 ttl=59 time=15.9 ms
64 bytes from 18.104.22.168: icmp_seq=2 ttl=59 time=12.7 ms
64 bytes from 22.214.171.124: icmp_seq=3 ttl=59 time=13.9 ms
64 bytes from 126.96.36.199: icmp_seq=4 ttl=59 time=23.9 ms
64 bytes from 188.8.131.52: icmp_seq=5 ttl=59 time=12.9 ms
This showed that there was a route to the external world, but the curl command could not resolve it. I suspected a DNS issue.
I checked the
kube-dns pods in
kube-sys namespace and they were not running. This was due to a configuration error — an easy fix. Once
kube-dns came up, the pod started working as well.
This debugging process seems really simple, but almost took a whole day for me. I hope this serves as a guide for others facing similar issues and helps resolve them sooner!
Stay tuned for Kubernetes deep dives!