Fixing a crashed container on OpenShift
This article provides a solution to fix a problem with a pod that did not start because of a configuration error in the command to run. There are certainly other ways to diagnose the cause of the problem, but I’m sharing this one since it worked for me.
Lastly, I‘ve been giving a try to the NATS Streaming Server to learn more about its support for durable subscriptions. Since the streaming server couldn’t be deployed on OpenShift with an operator yet, I decided to opt for a regular deployment using the existing Docker image.
The deployment
and service
manifests that I used looked as below:
$ cat nats-streaming-server.yaml
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nats-streaming-server
spec:
replicas: 1
template:
metadata:
labels:
name: nats-streaming-server
spec:
containers:
- name: "nats-streaming-server"
image: "nats-streaming:0.9.2"
command: ["nats-streaming-server"]
args: ["-m","8222"]
---
apiVersion: v1
kind: Service
metadata:
labels:
name: nats-streaming-server
name: nats-streaming-server
spec:
ports:
- name: client
port: 4222
protocol: TCP
targetPort: 4222
- name: mgmt
port: 8222
protocol: TCP
targetPort: 8222
selector:
name: nats-streaming-server
type: ClusterIP
Note that I needed to specified the command
and args
values in the spec.template.spec.containers
element to enable the management port, so I could later monitor the activity on the platform, but that’s another story.
After running the oc apply
command, I checked the pods and…
$ oc get pods
NAME READY STATUS
nats-streaming-server-6d56df9445-tjwdw 0/1 RunContainerError
Ahem…
Ok then, time to look at what’s happening with this pod. First, let’s see if the pod’s events can provide us with some information:
$ oc describe pod/nats-streaming-server-6d56df9445-tjwdw
Name: nats-streaming-server-6d56df9445-tjwdw
...
Events:
Type Reason Message
---- ------ -------
Normal Scheduled Successfully assigned nats-streaming-server-6d56df9445-tjwdw to localhost
Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "default-token-bvdmj"
Normal Pulled Container image "nats-streaming:0.9.2" already present on machine
Normal Created Created container
Warning Failed Error: failed to start container "nats-streaming-server": executable not found in $PATH
Warning BackOff Back-off restarting failed container
So here the problem: failed to start container “nats-streaming-server”: executable not found in $PATH
The command
value [“nats-streaming-server”]
in the deployment manifest is somehow wrong, but what should it be, then ? Can we just oc rsh
in the container to compare the value of $PATH
with the path to nats-streaming-server
?
$ oc rsh nats-streaming-server-6d56df9445-tjwdw
error: unable to upgrade connection: container not found ("nats-streaming-server")
Nope.
`docker export` to the rescue
The docker export
command allows for exporting a container’s filesystem as a tar file in the host, so it’s easy to check the content afterwards. But first, the CLI needs to be configured to in such a way that the docker
command targets the Docker daemon running on the OpenShift (well here, Minishift):
# configure the docker environment variables
$ eval $(minishift docker-env)# retrieve the id of the nats-streaming-server image
$ docker images | grep nats-streaming
nats-streaming 0.9.2 bf688abfd477 8 weeks ago 10.7MB# retrieve the id of the container running this
# 'nats-streaming' image
$ docker ps -a | grep bf688abfd477
c55940575b59 bf688abfd477 "nats-streaming-serv…" Created# export the content of the container in a tar file
$ docker export -o nats-streaming.tar c55940575b59# inspect the content of the tar file
$ tar tvf nats-streaming.tar | grep nats-streaming-server
-rwxrwxr-x 10725344 Apr 3 21:40 nats-streaming-server
Note that the container is in a Created
state, which explains why the oc rsh
command tried earlier could not work.
One the one hand, the archive contains the expected nats-streaming-server
binary, but located at the root of the filesystem. On the other hand, the value of the PATH
environment variable itself can be found by inspecting the container, in particular, the Config.Env
value (an array of strings):
$ docker inspect c55940575b59 -f '{{ range $index, $env := .Config.Env }}{{ println $env }}{{ end }}' | grep "PATH="
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
So that’s it! The nats-streaming-server
file at the root of the container is not in $PATH
. It’s now just a matter of changing the value of the command
element to ["/nats-streaming-server"]
in deployment manifest and applying it again, and the new pod is now running \o/
$ oc apply -f openshift/nats-streaming-deployment.yaml
deployment "nats-streaming-server" configured
service "nats-streaming-server" unchanged$ oc get pods
NAME READY STATUS
nats-streaming-server-66d45c8746-wwf7l 1/1 Running