Kubernetes IO Problem Investigation

Muhammet Ozekli
Trendyol Tech
Published in
5 min readNov 3, 2022

Hello everyone, today we will be talking about troubleshooting a bug that affects the performance and Response Time of our applications,

At Trendyol, as an e-commerce company, November is a very special month with Black Friday and sellers’ discounts, causing a huge amount of load/requests on our infrastructure. And to get ready for Black Friday, as an organization, lots of load tests are scheduled to find out the limitations of our infrastructure for the expected load(RPMs).

During one of the load tests, the team ran into an issue with latency between two APIs, client-side API got timeouts as accessing to server-side API, while the server-side API didn’t have high Response Times or the same amount of errors, besides high load, only indicator was IO waits and disk r/w time increases, as digging to it team could see apps can reach the CPU limitations which is defined in APIs deployment YAML, the ping response time between two IPs was increased and server-side API acts so normal, the investigation continued with network and OpenStack infra teams.

The network captured some network packages during the problem from both(client-server) sides and the communications looked healthy, the time frames were 1-digit ms levels.

With the OpenStack infra team, checked the esx processes for VMS, and the IO levels on esx, and since OpenStack hard limits the CPU reserved for VMS, the only problem seemed IO. (???)

While troubleshooting continued, also the November is getting closer each day, so to get ready, installed 2 new Kubernetes clusters with version 1.23.7 and the same worker nodes resources (30 nodes — 32cpu — 64ram), so for the next load tests, the team was gonna have four 1.20.7 version k8s clusters and two 1.23.7 version k8s clusters, all have the same spec as worker nodes perspective.

During the next load test, observed the same latencies in 1.20.7 k8s clusters with specific under of load, but the 1.23.7 versioned 2 k8s clusters were very smooth, with no timeouts, thankfully :)

So, got a case :D

After struggling and getting red herrings with disk performance measurement, and checking the load effects of the issue, thanks to Dinçer Çelik (our OpenStack infra team lead), discovered every 15 seconds, there are find and du operations and they cost much on disk operations.

root@worker-node:~# tail -f /var/log/audit/audit.log | egrep "/usr/bin/(find|du)"
type=SYSCALL msg=audit(1666719626.492:2574505571): arch=c000003e syscall=59 success=yes exit=0 a0=7fffa9ca4be0 a1=7fffa9ca4e00 a2=7fffa9ca4e38 a3=7f1b768ecac0 items=2 ppid=3234592 pid=1942246 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="du" exe="/usr/bin/du" subj=unconfined key="rootcmd"
type=PATH msg=audit(1666719626.492:2574505571): item=0 name="/usr/bin/du" inode=1776 dev=08:01 mode=0100755 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=SYSCALL msg=audit(1666719626.496:2574505572): arch=c000003e syscall=59 success=yes exit=0 a0=c0099ba570 a1=c008d4be60 a2=c00c5254d0 a3=8 items=2 ppid=3234592 pid=1942247 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="find" exe="/usr/bin/find" subj=unconfined key="rootcmd"
type=PATH msg=audit(1666719626.496:2574505572): item=0 name="/usr/bin/find" inode=1625 dev=08:01 mode=0100755 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=SYSCALL msg=audit(1666719626.496:2574505578): arch=c000003e syscall=59 success=yes exit=0 a0=7ffda1748ad0 a1=7ffda1748cf0 a2=7ffda1748d28 a3=7f65d6de7ac0 items=2 ppid=3234592 pid=1942248 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="du" exe="/usr/bin/du" subj=unconfined key="rootcmd"

The combination with find — du operations and high load on applications, caused high R/W time on OpenStack disks,

After seeing the problem, try to figure out what causes/runs find and du commands, checked the process which runs them and it was the kubelet process. After seeing that both master and worker nodes have the same audit logs, immediately checked our DaemonSets that may have caused the problem, and started to scale down one by one the DS below, and after scaling down them, nothing changed, the find, and du operations were still in the audit logs, so got suspicious kubelet itself.

fluent-bit
fluent-bit-xxx
blackbox-exporter
process-exporter
node-exporter
opentelemetry-collector-agent
falco
twistlock

After a quick IO problem search on google the issue below popped out, and it looks like the cAdvisor runs these ops to monitor the Inodes.

But the problem is that cAdvisor refactor the relevant code with GO implementation with this commit for version 0.33.0, as it shows in the changelog. So it shouldn’t be in the higher versions.

And the cluster which got the problem within, was Kubernetes version 1.20.7, according to go.sum our cAdvisor version must be 0.38.8 and there shouldn’t be find’s and du’s , but here they are :)

And with the 1.23.7 Kubernetes version cAdvisor comes with the 0.43.0 version which has no problem at all.

The team tried to find the version of cAdvisor in 1.20.7 version but the metrics don’t show the version info itself, unfortunately.

cadvisor_version_info{dockerVersion="Unknown", endpoint="https-metrics", instance="xxxx:yyy", job="kubelet", kernelVersion="xxxx", metrics_path="/metrics/cadvisor", namespace="kube-system", node="worker-node", osVersion="xxx", service="prometheus-operator-kubelet"}

Tried to find it with strings command as below but the version info couldn’t be found either, unfortunately. It seems the cAdvisor version of 1.20.7 looks suspicious enough about being under the 0.33.0 version.

root@worker-node:~# strings /usr/local/bin/kubelet | grep -i cadvisor | grep -E "0.3|0.2"

If you guys have any issue similar to our problem or if you think something is missing, please let us know how you dealt with it, it would be really helpful.

Thanks for reading, Happy Troubleshooting :)

Hopefully Loading …

--

--