Troubleshoot Kubernetes Cluster Autoscaler takes 1 hr to scale 600 pods up
I want to share my experience in troubleshooting k8s auto scaling scale out issue. It is rather fun. Read on if you want to follow this journey, and feel free to give me any feedbacks.
A k8s cluster is meant to scale on demand for cost effectiveness. I learned that computing resources may be unlimited in the cloud, but balancing cost and effectiveness is an on-going challenge.
A tenant starts testing his service in the cluster. It is a legacy JAVA app containerized from the on-perm version. This is a memory intensive app and needs to scale up from 3 replicas to 600 during the peak hour. During testing, it was found that it takes more than an hour, that is NOT an acceptable lead time.
We saw the following events happens often using:
$ kubectl get events -w
- When scaling to 600, it takes almost 40 minutes to download the image, contrasting it when scaling to just 1, it takes 4 minutes.
- Error messages relating to disk spaces “no space left in device”
- Pods are Evicted because “the node was low on resource: ephemeral-storage”
- “ErrImagePull” keeping showing
When scaling up to 600, Cluster Autoscaler kicks in to deploy 12 more worker nodes due to CPU requirement per pod.
Check for the obvious:
- is metrics-server running
- is cluster-autoscaler running
- is HPA working
Questions, but no answers…
Are these events related to nodes start up? Why is that an issue running out of disk space? How come it takes that much longer to pull an image?
Test #1
Let’s see if the Cluster Autoscaler is faulty; test it with a different deployment using a busybox image.
Cluster Autoscaler tested out fine.
I noticed there is a tendency to look for faults in those commonly available tools; such as coreDNS, Metrics Server, and Cluster Autoscaler. While it is possible, but not very likely, as our clusters use the stable versions of these tools. I would not look for errors in these tools unless I exhausted all the configurations and options.
Test #2
Our clusters segregated ingress nodes from application nodes to provide the isolation for higher security. So would it because Cluster Autoscaler can’t discern these 2 node groups?
How about testing this theory by running 2 Cluster Autoscalers in the same cluster. Each one takes care of one node group. It took some time to set up, but unfortunately the hypothesis is proven not completely wrong. Scale up time has reduced by 10 minutes. Still the lead time is NOT ACCEPTABLE.
Test #3
Looking in the worker node with running pods and stable, we notice it is only taking 19G of disk space:
Does it happen to only one new node? Sure, when we scale up to one pod in the new node, there is no problem. but when we scale up to 10 pods in the new nodes, it takes an hour again.
Ok.. we now know it is related to launching the pods parallel in the new node.
We also notice it is super quick in nodes when pods running, because it didn’t have to pull the image from the repo, as image is pulled from the node because image is “already present on the machine”.
Test #4
Now, let’s find out what happened when 10 pods were launched in parallel onto a new node; let’s find out where the congestion is by login to the worker node and watch “df -kh” as well as “top”
We saw that it does run out of disk space, thus we saw the “no space left in device”.
Test #5
Next, let’s bump the disk size by 5 folds.
With 5X more HDD, we don’t see the “no space left in device” or “ the node was low on resource: ephemeral-storage” anymore. However it still takes 35 minutes to scale up.
We watched the processes in “top” command and noticed that the CPU is very busy with the extraction (unzip) of images using process “unpigz”, see the left window below. That is why it takes so long for an image to be pulled and a pod to run, because the worker node is too busy un-zipping the images.
Test #6
Looks like the solution would be putting the image on the node or somehow downloading ahead of time. Neither of these 2 approaches is practical for our configuration.
We then found that in kubelet, we can add an option: “ — serialize-images-pull” . It would work for us because there is only one application deployment in this cluster.
Now you see only one “unpigz” process.
We DID it!
Viola! Adding this option solved the resource problem, no need to increase the HDD to 5X, because there is no need to uncompress 10 copies of the same image, and take up the disk spaces. And the worker node CPU is not completely consumed by uncompressing 10 copies of the same image.
We only need to add the “serialize-images-pull” option, and the scale up time is reduced from 60 minutes to 6 minutes.It is the journey that counts, except when we are under to gun to solve the problem. Hope we all learn something during this journey.
Not unlike Agatha Christie’s murder mysteries. Looking back we had all the clues, but took some twists and turns.