Kubernetes Tip: How To Disambiguate A Pod Crash To Application Or To Kubernetes Platform?

Recently, I was debugging two completely different Pod crash scenarios. In the first case, the Pod got into CrashLoopBackOff almost immediately after it got created while in another case the pod was killed randomly after running for a few days.

I spent about an hour to determine the root cause of these crashes. For the first case was able to isolate the problem to the application while the fix for the second crash was in the Kubernetes platform. So, the ownership of the fix was with different teams for these crashes.

This situation got me into thinking as to how quickly can one isolate Pod crashes if not fix it so as to assign to the right team for reduced turnaround time. In other words, was looking for a method/procedure for reduced MTTR (Mean Time To Resolution) in Pod crash scenarios.

Let us consider simple examples to isolate the Pod crash scenarios.

Application Related Pod Crash.

In this simple example, Application related Pod failure is created having an invalid argument variable in the YAML file. The YAML file looks as in Figure-1. It should be noticed that the argument field has invalid command /bin/eho instead of /bin/echo.

Figure-1: YAML Having Incorrect CMD.

Let’s apply this YAML as in Figure-2.

Figure-2: Pod in CrashLoopBackOff State.

Let’s get the exit code of the container as shown in Figure-3.

Figure-3: Exit Code Of The Container Is 127.

What does Exit Codes Of Container Mean?

Exit codes are numbers that represent how the container exited when executed last time. These are similar to process exit codes.

In my case, the container runtime is docker and the exit code noticed here are docker’s exit codes. Reference-1 provides a list of docker exit codes. For 127, says the following

Docker Exit Code: 127. if the contained command cannot be found

How cool!!. Now I know the problem is related to command in the YAML file or something inside the application and not a problem the Kubernetes platform. So, I can shove it to the application team to fix the issue.

Let’s consider another example where the problem is related to the Kubernetes Platform.

Kubernetes Related Pod Crash.

In this example, Kubernetes related Pod failure is created for a normal pod. The YAML for the Pod is as in Figure-4. Figure-5 shows that's the Pod is running normally.

Figure-4: YAML of Pod
Figure-5: Demo Pod Running Normally.

I ssh’ed into the node where the Pod runs and did a docker kill for the pod-demo and looked at the exit code as in Figure-6:

Figure-6: Exit Code Of Container Is 137.

According to Reference-2,

Docker Exit Code: 137, Container received SIGKILL where SIGKILL could be due to OOMKill or random kill.

This is awesome. Now I know that the container received is SIGKILL that could be due to OOMKilled, Eviction or accidentally killed by a user. Now, problem isolation can start from the Kubernetes platform instead of the application.

The Fun Part.

What does the exit code look like when the Pod completes normally?. Figure-7 captures this scenario.

Figure-7: Fun Pod Having Exit Code As 0

What does docker exit code 0 mean?

According to Reference-3

Exit Code 0 does not mean something is actually bad but the container does have any foreground process attached to it. This occurs normally when docker container has completed its’ job.

Recommendation.

The rule of thumb is to look at the exit code when the Pod crashes. Based on the exit code would disambiguate the problem to either the application or the Kubernetes platform using the following rules.

Exit Code 1 & 2: These errors are due to the application. Would start the RCA from the appliaction.

Exit Code 126 & 127: Would start looking at dockerfiles, yaml files and possiblity application.

Exit Code 128 & above: Would start debugging from Kubernetes Platform perspective and go up the stack.

References.

Reference-1: https://docs.docker.com/engine/reference/run/#exit-status

Reference-2: https://stackoverflow.com/questions/43803610/where-to-find-more-explicit-errors-given-container-error-status-codes

Reference-3: https://medium.com/better-programming/understanding-docker-container-exit-codes-5ee79a1d58f6

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store