How Do We Mitigate Memory Leak in Kubernetes with a One-liner Commit

Published in

Trendyol Tech

10 min readMay 10, 2023

Hello everyone, today we will talk about a 3-month agony troubleshooting journey of a production problem that caused more than 10 incidents. As Trendyol SRE , incidents are huge focus point of ours.

At mid of January, at Trendyol, main gateway APIs that connect customers to Trendyol, started to return 503 errors, especially after 18:00 at weekends, when the rpms get higher(About 100 Million rpm as daily peak). We checked the kubernetes clusters and kubernetes alerts, 3–4 nodes per cluster, were down. The reason why the nodes got down was node OOM. Examined one of the nodes. It seemed there were no RPM spikes than usual (image-2), that's why cpu seemed normal until more GC activities were needed, then CPU usage got higher.

When checking the cluster overview, we realized that some of the nodes are burning out because of memory but the others seem idle even though there are business pods running on them.

Last year at Trendyol, thanks to the sfx(storefront experience) framework that has been developed by Trendyol sfx team, our gateway APIs were transformed to nano-service from micro service, also the apps converted from rest APIs to grpc services. The result of this transition is having 300+ more apps and 1800+ pods per cluster which has 45 nodes. So to understand why the pods are not balanced between the nodes, we checked our descheduler rules, we use something similar below.

strategies:
  LowNodeUtilization:
     enabled: true
     thresholdPriority: 10000
     params:
       nodeFit: true
       nodeResourceUtilizationThresholds:
         thresholds:
           cpu : 50
           memory: 50
           pods: 30
         targetThresholds:
           cpu : 70
           memory: 70
           pods: 65

Deschedule checks the requested resources to do the descheduling, since pods should be descheduled, we checked pods' memory and cpu requests. There was twice or three times difference between requested memory and memory usage when usage is more than the requested memory (requested =~ 64MB, usage =~ 256MB ), we thought, that may be the reason the descheduler does not balance the nodes since pod count can go up to 110 per node, this may have caused the nodes underutilization problem.

Tuning-1:

Both for best practice and better node utilization, we started to analyze the deployments and define better resource requests.

After the updated deployments, we monitored the clusters for 2 weeks and the problem still continued, but nodes started going down while deploying the apps, we realized that 300+ apps were being deployed at the same time, since based on the gitops practices, we use gitlab for ci and argoCD for CD. With deployment rollout strategies, 600~ pods were initialized at once.

Tuning-2:

Deployment behavior is considered aggressive and decided to divide it into app dependencies because grpc services clients and servers can be dependent on each other heavily. And that decreased the number of apps that deploys at the same time, for example, 40–36–54–80–120.

After the deployment frequency has changed, in 1-week observation, the problem still continued. So changed the direction, and while checking the node memory dashboard, realized that the difference between node usage and pod usage is very huge. To compare, checked the other k8s clusters, as trendyol infrastructure, there are 330+ kubernetes clusters, as you can see on the public dashboard.

The difference between node usage and pod usage looks suspicious and other clusters seem ok.

When we checked the node itself, it seemed like the containerd root process was consuming all the memory,

root@worker-30:~# ps aux --sort -%mem
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        1090  8.4 74.0 122810972 48868756 ?  Ssl  Feb10 1630:12 /usr/bin/containerd
10001    3730741  0.5  0.5 1202592 364844 ?      Ssl  Feb22   7:53 /otelcol --config=/conf/relay.yaml

And seeing the bigger picture hit it, we have a Memory Leak !!!

In our inventory, managing multiple kubernetes cluster versions, after 1.20 we started to use containerd instead of docker, and we have 200+ kubernetes clusters that use containerd. Since the other clusters' worker nodes are ok, no leak was observed, and it started mid of January, suspected that with the sfx transformation, somehow we triggered a bug on containerd. So let's find out what it is !!

The first thing we checked was if there was any difference between 1.20.X (containerd 1.4.6) and 1.23.X (containerd 1.6.4), Memory leak frequency is higher in 1.4.6, so that's why at first we thought leak only consists on 1.4.6 version. And try to compare them. There is a config difference between the two versions.

#containerd://1.4.6
root@worker-30:~# systemctl show containerd | grep LimitNOFILE
LimitNOFILE=1048576
LimitNOFILESoft=1048576


#containerd://1.6.4
root@worker-31:~# systemctl show containerd | grep LimitNOFILE
LimitNOFILE=infinity
LimitNOFILESoft=infinity

But after a closer look, it seemed both versions have memory leaks, so we checked some behavior differences.

draining a node releases the memory on the 1.23.X version but on the 1.20.X version, we have to either restart the node or containerd service itself.
1.4.6 version containerd logs way more than 1.6.4. (it was a cAdvisor issue on kubernetes version 1.20.X, you can access it in this article)
Memory leak frequency is higher in 1.4.6.

Since both versions have leaks, we tried to concentrate on the application side. First controlled what kind of memory causing the leak, thanks to the detailed Grafana dashboards which is developed by Trendyol container team, it seemed inactive memory was causing the memory leak. So IO operations may be causing the problem

Tuning-3:

We checked the Application side and tried to find a pod pattern, there were some pod patterns first for a specific group of applications, our sfx team checked the shared libraries and found some issues about retrieving the timezone data with LoadLocationFromTZData that has been called for every request, seemed very expensive operations for every request, teams fixed that issue.

But Leak continues !!!

Then, to see where we are going, we decided to take a dump of profiling data on the containerd process and application but since the pprof is not enabled on applications, focused on the containerd process. Checked some profiling tools like bytehound, and heaptrack and tried to install heaptrack but too many uninstalled dependencies or not matching versions blocked it. We checked valgrind too but didn't take us anywhere.

valgrind --help
valgrindvalgrind --tool=memcheck --leak-check=yes --show-reachable=yes --num-callers=20 --track-fds=yes --pid=pid
valgrind --tool=memcheck --leak-check=yes --show-reachable=yes --num-callers=20 --track-fds=yes --pid=pid
valgrind --attach pid --leak-check=full
valgrind --attach=pid --leak-check=full

Followed the step defined in a veritas ticket and stackoverflow issue for gcore and gdb usage, took a dump with gcore, and tried to analyze it locally with gdb commands but symbols failed.

gcore pid
mv core.pid /tmp/
chmod 777 /tmp/core.pid

gdb /usr/local/go/bin/go containerd_time.dump
gdb /usr/local/go/bin/go core.pid

So tested on the worker node directly on runtime but it hung the containerd process, and containers started to fail. Also had some symbol issues too.

gdb /usr/bin/go pid

After a quick search to access containerd pprof data, found ctr command, and how to enable the containerd pprof socket. For one specific node that has containerd 1.4.6, changed the config.toml under the /etc/containerd/ directory and restarted the service.

[debug]
  level = "debug"
  address = "/run/containerd/debug.sock"
  format = "json"

Debug-level config changes allowed for creating a heap.pprof file with the command;

ctr pprof heap >> heap.pprof

Retrieved the pprof file to the local environment, and tried to open it with go tool.

go tool pprof -web tmp/heap.pprof

As an output, it showed that the most memory used by containerd is redirectLogs function. First thought was changing the logging level of containerd on config.toml file, so debug level may have caused 10gb memory consumption. It was a possibility. We decided to open an issue on containerd github repo and wait for some recommendations.

To change the point of view, We have checked the containerd issues for memory leak problems before, and came across with readiness probe leak, since all of our applications have either rest or grpc readiness probe, decided to try upgrading the containerd version to the latest version to test if the leak consists.

Tuning-4:

We followed the procedure below to upgrade the containerd to the 1.7.0 version.

binary upgrade: 1.6.4 to 1.7.0

salt 'worker*' cmd.run "wget https://github.com/containerd/containerd/releases/download/v1.7.0/containerd-1.7.0-linux-amd64.tar.gz"
salt 'worker*' cmd.run "tar Czxvf /usr/local containerd-1.7.0-linux-amd64.tar.gz"
salt 'worker*' cmd.run "systemctl restart containerd.service"

1.4.6 to 1.7.0

salt 'worker*' cmd.run "wget https://github.com/containerd/containerd/releases/download/v1.7.0/containerd-1.7.0-linux-amd64.tar.gz"
salt 'worker*' cmd.run "tar Czxvf /usr containerd-1.7.0-linux-amd64.tar.gz"
salt 'worker*' cmd.run "systemctl restart containerd.service"

After the upgrade the leak aggressiveness slower down but unfortunately,

The leak was still there!

Since the leak continues after the upgrade, we wanted to see what may change on the containerd profiling. In this round, wanted to follow the diffs/changes on memory step by step, so created multiple dump files while the memory increased.

Total usage 10.8 gb,

root@worker-1:~# ps aux --sort -%mem | head -10
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     2815080 13.3  5.5 11877272 3627608 ?    Ssl  13:10  13:52 /usr/local/bin/containerd

Total usage 13.8,

root@worker-1:~# ps aux --sort -%mem | head -10
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     2815080 12.1  9.9 18278020 6533084 ?    Ssl  13:10  35:35 /usr/local/bin/containerd

Total usage 20.1gb

root@worker-1:~# ps aux --sort -%mem | head -10
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     2815080 12.1 18.4 20911492 12131872 ?   Ssl  Apr07  85:36 /usr/local/bin/containerd

While 20.1 gb usage,

root@-worker-1:~# free -h
              total        used        free      shared  buff/cache   available
Mem:           62Gi        17Gi        36Gi        28Mi       8.9Gi        44Gi
Swap:            0B          0B          0B

Tuning-5:

These outputs convinced us to look at the logging deeper. So started to scale down our fluent-bit daemon sets to eliminate reading/writing scenarios one by one. When the reading part is stopped,

And the leak was still there!

After scaling up the fluent-bit, we have seen the “Skipping the long lines” warnings on fluent-bit pods. And our existed fluent-bit input config has a 5m Buffer size.

    [INPUT]
      Name              tail
      Alias             kube_container_logs
      Tag               kube.<namespace_name>.<container_name>.<pod_name>.<docker_id>-
      Tag_Regex         (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
      Buffer_Chunk_Size 512k
      Buffer_Max_Size   5m
      Path              /var/log/containers/*.log
      Parser            cri
      DB                /var/log/flb_kube.db
      Mem_Buf_Limit     512MB
      Skip_Long_Lines   On
      Refresh_Interval  10

Tuning-6:

When we checked the errors with the SFX team, they mentioned that some stack trace errors may have caused these errors, so they released another version in which stack trace logs are disabled. After disabling the st logging, monitoring the memory,

But the leak was still there!

While the team was checking the stack trace logging, the team discovered something, which was worth to test it.

At Trendyol we try to manage 3000+ microservices, following up a business flow can be very very challenging, so last year we decided to use distributing tracing and started to use opentelemetry for this, and with SFX nano service transition, teams started to enable otel on golang bff services.

And team discovered that a code line that prints out the go context…

fmt.Printf("Context => %v", ctx)

Tuning-7:

So we got another released version just without the console printout of the otel context and deployed it immediately. we started to monitor it, since the leak can take time, we had to wait at least one day, but it was looking promising (first positive progress in last 3 months :) ).

Couldn’t sleep on 12 April night, checked the memory consumption every 2 hours.

Very excited to see the straight line on Grafana dashboards. Can't express the happiness/relief/hopefulness :D

The memory release of some apps that had more throughput, was massive, up to about 40 GB.

And the one commit that saved us all:

Since this usage can be very common between domain teams, we prepare Grafana dashboards to find out the huge diffs between node memory usage and pod usage.

The tests for finding out the bug on containerd are still in progress. If we have any further discoveries, we will be sharing as next part of this article. Thank you for reading.

Special thanks to : Gokhan Karadas, Ahmet Can Güven, Gökçe Sürenkök, developer-guy, Furkan Türkal, Emin Aktaş

Want to work in this team?

Be a part of something great! Trendyol is currently hiring. Visit the pages below for more information and to apply.

Have a look at the roles we’re looking for!