Fixing the Dreaded ClickHouse Crash Loop on Kubernetes
As countless ClickHouse users have learned, Kubernetes is a great platform for data. It’s portable to almost every IT environment. Managed Kubernetes services like Amazon EKS simplify operation. And the Altinity Kubernetes Operator for ClickHouse lets you start complex ClickHouse clusters from a single resource file.
But there’s still the occasional dark cloud. One of these is pod crash loops, which occur when a ClickHouse pod crashes on startup. Here’s an example that shows a pod crash loop in progress.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
chi-crash-demo-ch-0-0-0 0/1 CrashLoopBackOff 4 (19s ago) 116s
chi-ok-demo-ch-0-0-0 1/1 Running 0 2m4s
Pod crash loops often arise because of ClickHouse misconfiguration or version upgrade issues. They are generally straightforward to fix once you know the cause. But how can you figure out what’s happening? This blog article walks through the steps to diagnose and fix crash loop problems.
Dawn of a crash loop
The examples I’m about to provide use Kubernetes 1.22 running on Minikube, ClickHouse 21.8.11.1 ( Altinity Stable build), and Altinity Operator for ClickHouse version 0.18.3. To keep things simple, storage definitions, AZ assignments, and other niceties are omitted.
OK, let’s create a healthy ClickHouse server. We start with a very simple pod definition, which we’ll store in file crash.yaml.
apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
name: "crash-demo"
spec:
configuration:
clusters:
- name: "ch"
templates:
podTemplate: clickhouse-stable
templates:
podTemplates:
- name: clickhouse-stable
spec:
containers:
- name: clickhouse
image: altinity/clickhouse-server:21.8.11.1.altinitystable
We apply the definition and have a look at the resulting pod.
$ kubectl apply -f crash.yaml
clickhouseinstallation.clickhouse.altinity.com/crash-demo created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
chi-crash-demo-ch-0-0-0 1/1 Running 0 10s
Everything is healthy so far. Now let’s break it by adding a bad configuration file into the resource definition. Here’s the new definition with the offending text highlighted.
apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
name: "crash-demo"
spec:
configuration:
clusters:
- name: "ch"
templates:
podTemplate: clickhouse-stable
files:
badconfig.xml: |
<yandex>
<foo><baz>
</yandex>templates:
podTemplates:
- name: clickhouse-stable
spec:
containers:
- name: clickhouse
image: altinity/clickhouse-server:21.8.11.1.altinitystable
Let’s apply and see what happens. Note: You may have to wait a minute or two before ClickHouse picks up the change and the pod restarts.
$ kubectl apply -f crash.yaml
clickhouseinstallation.clickhouse.altinity.com/crash-demo configured
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
chi-crash-demo-ch-0-0-0 0/1 CrashLoopBackOff 1 (12s ago) 15s
Boom! We have just created a pod crash loop. The pod will continuously restart until we figure out what is wrong and fix it.
Getting to the root cause
We now have a broken pod to play with. How do we figure out what happened? Let’s go through the steps in order.
Check pod events using `kubectl describe`
The `kubectl describe` command shows you configuration data and events related to a currently executing pod. This should be your first stop if a pod is not coming up for any reason, including pod crash loops. Here is an example to describe our pod.
$ kubectl describe pod/chi-crash-demo-ch-0-0-0
Name: chi-crash-demo-ch-0-0-0
Namespace: default
. . .
(Lots of configuration information)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m28s default-scheduler Successfully assigned default/chi-crash-demo-ch-0-0-0 to logos2
Normal Pulled 3m44s (x4 over 4m27s) kubelet Container image "altinity/clickhouse-server:21.8.11.1.altinitystable" already present on machine
Normal Created 3m44s (x4 over 4m27s) kubelet Created container clickhouse
Normal Started 3m44s (x4 over 4m27s) kubelet Started container clickhouse
Warning BackOff 3m13s (x13 over 4m25s) kubelet Back-off restarting failed container
If you made a simple configuration mistake, such as picking a bad pod name, you’ll see it here. The event output also prints useful messages if you can’t allocate storage or don’t have enough resources to schedule the pod, e.g., insufficient memory or CPU. If you see a problem, correct the resource file and apply it again with kubectl.
In this case there’s nothing useful in the message, so we’ll have to proceed to the next step.
Check pod logs with `kubectl logs`
The ClickHouse pod may be crashing, but that does not mean we can’t see the logs from outside. Our next step is to use the ‘kubectl logs’ command, which will show messages as ClickHouse starts. Here’s an example.
$ kubectl logs pod/chi-crash-demo-ch-0-0-0
Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = Exception: Failed to merge config with '/etc/clickhouse-server/config.d/badconfig.xml': SAXParseException: Tag mismatch in '/etc/clickhouse-server/config.d/badconfig.xml', line 3 column 2 (version 21.8.11.1.altinitystable (altinity build))
. . .
In this case, the logs show us a useful message right away. There is something wrong with file badconfig.xml, so we now know where to look. We can fix the resource definition and apply it with kubectl.
Change the pod entry point and debug using `kubectl exec`
In some cases, it’s not enough to see logs to figure out what’s going on. We need to get into the pod and go mano-a-mano with ClickHouse. The key to debugging the pod is to make it come up but not run ClickHouse.
To make the pod come up and halt, we need to make two simple changes to the ClickHouse configuration. We’ll change the entrypoint to run a sleep command and we’ll alter the liveness probe. Here’s the configuration file, with changes commented and highlighted.
apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
name: "crash-demo"
spec:
configuration:
clusters:
- name: "ch"
templates:
podTemplate: clickhouse-stable
files:
badconfig.xml: |
<yandex>
<foo><baz>
</yandex>templates:
podTemplates:
- name: clickhouse-stable
spec:
containers:
- name: clickhouse
image: altinity/clickhouse-server:21.8.11.1.altinitystable
# Add command to bring up pod and stop.
command:
- "/bin/bash"
- "-c"
- "sleep 9999999"
# Fix liveness probe so that we won't look for ClickHouse.
livenessProbe:
exec:
command:
- ls
initialDelaySeconds: 5
periodSeconds: 5
Apply the updated file and wait until the pod starts successfully. Again, this might take a couple minutes.
$ kubectl apply -f crash.yaml clickhouseinstallation.clickhouse.altinity.com/crash-demo configured
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
chi-crash-demo-ch-0-0-0 0/1 Running 0 5s
OK, success! We are ready to connect and figure out what’s wrong. But first, one question: why did we alter the liveness probe? It’s an important trick.
The liveness probe is used by Kubernetes to check whether the pod is working or not. For ClickHouse, clickhouse-operator configures a liveness probe to run an HTTP GET against the ClickHouse /ping URL. If the liveness probe fails-and it will because ClickHouse can’t start-Kubernetes will eventually notice and restart the pod. That’s a little disappointing if you are right in the middle of debugging problems.
Now that the pod is patiently, let’s use kubectl exec to get in and see what’s going on. We enter the following command to get to the bash prompt.
$ kubectl exec -it chi-crash-demo-ch-0-0-0 -- bash
root@chi-crash-demo-ch-0-0-0:/#
Cool, we’re in and can start poking around to diagnose the problem. The simplest way is to start ClickHouse manually and see what happens. Here’s what we see.
# clickhouse-server -C /etc/clickhouse-server/config.xml
Processing configuration file '/etc/clickhouse-server/config.xml'.
Merging configuration file '/etc/clickhouse-server/conf.d/chop-generated-macros.xml'.
Merging configuration file '/etc/clickhouse-server/config.d/01-clickhouse-01-listen.xml'.
Merging configuration file '/etc/clickhouse-server/config.d/01-clickhouse-02-logger.xml'.
Merging configuration file '/etc/clickhouse-server/config.d/01-clickhouse-03-query_log.xml'.
Merging configuration file '/etc/clickhouse-server/config.d/01-clickhouse-04-part_log.xml'.
Merging configuration file '/etc/clickhouse-server/config.d/badconfig.xml'.
Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = Exception: Failed to merge config with '/etc/clickhouse-server/config.d/badconfig.xml': SAXParseException: Tag mismatch in '/etc/clickhouse-server/config.d/badconfig.xml', line 3 column 2, Stack trace (when copying this message, always include the lines below):
...
We now see what’s wrong. The bad configuration file we inserted is biting us, just as we saw from looking at logs using kubectl.
Advanced debugging
In some cases we might still not understand why ClickHouse is crashing. Networking issues are a common reason why further work is necessary. In this case you may need more debugging tools than are available on the stripped down container that runs ClickHouse.
You can get additional packages using ‘apt install’. For example, say you need the ping command to diagnose network connectivity. Here’s how to get it.
$ kubectl exec -it chi-crash-demo-ch-0-0-0 -- bash
root@chi-crash-demo-ch-0-0-0:/# apt update
. . .
root@chi-crash-demo-ch-0-0-0:/# apt install iputils-ping
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
libcap2 libcap2-bin libpam-cap
The following NEW packages will be installed:
iputils-ping libcap2 libcap2-bin libpam-cap
0 upgraded, 4 newly installed, 0 to remove and 21 not upgraded.
Need to get 90.5 kB of archives.
After this operation, 333 kB of additional disk space will be used.
Do you want to continue? [Y/n] Y
. . .
root@chi-ok-demo-ch-0-0-0:/# ping www.yahoo.com
PING new-fp-shed.wg1.b.yahoo.com (98.137.11.163) 56(84) bytes of data.
64 bytes from media-router-fp74.prod.media.vip.gq1.yahoo.com (98.137.11.163): icmp_seq=1 ttl=48 time=31.2 ms
Bear in mind that any tools you install will disappear when the pod restarts. This is a feature, not a bug; you don’t have to worry about cleaning up the debris left from diagnosing problems.
Fixing the ClickHouse pod
Depending on the crash loop cause there may be different fixes. Here are three cases we often see and how to fix them.
- Configuration file error. Fix the configuration in the Kubernetes resource definition and apply using kubectl. Don’t fix configuration issues on the file system. Your fixes will disappear when the pod restarts.
- Bad SQL file after upgrade. Sometimes old table definitions have SQL that is no longer supported in a new ClickHouse version. If the table is not needed, you can fix it by moving the table definition file out of /var/lib/clickhouse/metadata/ to /var/lib/clickhouse. ClickHouse then won’t see it when trying to boot. (Do this fix using a `kubectl exec` session; it will persist when ClickHouse restarts.)
- ClickHouse bad upgrade. You have upgraded to a bad version of ClickHouse, for whatever reason. This is rare but happens. Set the version number in the Kubernetes resource definition back to the previous working version and apply using kubectl.
In our example, the root cause is case 1. We can exit the pod (if we’re using kubectl exec) and go back to the resource definition. Let’s comment out the bad configuration file plus the commands used to halt the pod on startup. Here’s the new file.
apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
name: "crash-demo"
spec:
configuration:
clusters:
- name: "ch"
templates:
podTemplate: clickhouse-stable
# files:
# badconfig.xml: |
# <yandex>
# <foo><baz>
# </yandex>templates:
podTemplates:
- name: clickhouse-stable
spec:
containers:
- name: clickhouse
image: altinity/clickhouse-server:21.8.11.1.altinitystable
# # Add command to bring up pod and stop.
# command:
# - "/bin/bash"
# - "-c"
# - "sleep 9999999"
# # Fix liveness probe so that we won't look for ClickHouse.
# livenessProbe:
# exec:
# command:
# - ls
# initialDelaySeconds: 5
# periodSeconds: 5
Let’s apply it and see what happens. Once again, you may need to wait a minute or two for the pod to restart.
$ kubectl apply -f crash.yaml
clickhouseinstallation.clickhouse.altinity.com/crash-demo configured
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
chi-crash-demo-ch-0-0-0 1/1 Running 0 24s
The configuration file does not look very pretty with the extra comments, but no matter. ClickHouse is up and applications are back online. We can clean up at leisure.
Conclusion
Pod crash loops are rare but they do arise from time to time. The fact that pods are “closed” can make debugging difficult the first time it happens to you, especially on a production system. Fortunately, there are abundant tools, and you can often diagnose issues without directly connecting to the pod using ‘kubectl exec.’ I hope this blog article will help you the next time you run into the problem.
At Altinity we are huge fans of running ClickHouse on Kubernetes. Don’t hesitate to contact us if you have further questions. You can use the Contact Us form, join our Slack Workspace, or post issues on the Altinity Operator for ClickHouse project on GitHub. It’s open source and we love to help users as well as make the code even better. See you soon!
Originally published on the Altinity Blog on March 14, 2022.