Kubernetes Tip: How To Disambiguate A Pod Crash To Application Or To Kubernetes Platform?
Recently, I was debugging two completely different Pod crash scenarios. In the first case, the Pod got into CrashLoopBackOff almost immediately after it got created while in another case the pod was killed randomly after running for a few days.
I spent about an hour to determine the root cause of these crashes. For the first case was able to isolate the problem to the application while the fix for the second crash was in the Kubernetes platform. So, the ownership of the fix was with different teams for these crashes.
This situation got me into thinking as to how quickly can one isolate Pod crashes if not fix it so as to assign to the right team for reduced turnaround time. In other words, was looking for a method/procedure for reduced MTTR (Mean Time To Resolution) in Pod crash scenarios.
Let us consider simple examples to isolate the Pod crash scenarios.
Application Related Pod Crash.
In this simple example, Application related Pod failure is created having an invalid argument variable in the YAML file. The YAML file looks as in Figure-1. It should be noticed that the argument field has invalid command /bin/eho instead of /bin/echo.
Let’s apply this YAML as in Figure-2.
Let’s get the exit code of the container as shown in Figure-3.
What does Exit Codes Of Container Mean?
Exit codes are numbers that represent how the container exited when executed last time. These are similar to process exit codes.
In my case, the container runtime is docker and the exit code noticed here are docker’s exit codes. Reference-1 provides a list of docker exit codes. For 127, says the following
Docker Exit Code: 127. if the contained command cannot be found
How cool!!. Now I know the problem is related to command in the YAML file or something inside the application and not a problem the Kubernetes platform. So, I can shove it to the application team to fix the issue.
Let’s consider another example where the problem is related to the Kubernetes Platform.
Kubernetes Related Pod Crash.
In this example, Kubernetes related Pod failure is created for a normal pod. The YAML for the Pod is as in Figure-4. Figure-5 shows that's the Pod is running normally.
I ssh’ed into the node where the Pod runs and did a docker kill for the pod-demo and looked at the exit code as in Figure-6:
According to Reference-2,
Docker Exit Code: 137, Container received SIGKILL where SIGKILL could be due to OOMKill or random kill.
This is awesome. Now I know that the container received is SIGKILL that could be due to OOMKilled, Eviction or accidentally killed by a user. Now, problem isolation can start from the Kubernetes platform instead of the application.
The Fun Part.
What does the exit code look like when the Pod completes normally?. Figure-7 captures this scenario.
What does docker exit code 0 mean?
According to Reference-3
Exit Code 0 does not mean something is actually bad but the container does have any foreground process attached to it. This occurs normally when docker container has completed its’ job.
The rule of thumb is to look at the exit code when the Pod crashes. Based on the exit code would disambiguate the problem to either the application or the Kubernetes platform using the following rules.
Exit Code 1 & 2: These errors are due to the application. Would start the RCA from the appliaction.
Exit Code 126 & 127: Would start looking at dockerfiles, yaml files and possiblity application.
Exit Code 128 & above: Would start debugging from Kubernetes Platform perspective and go up the stack.