Klaytn State Trie Cache Series #1: Identifying the Cache Problem
See the list of articles here.
🇰🇷: Klaytn State Trie Cache Series #1: Cache 문제 원인 파악하기
Improving the blockchain platform has been one of Klaytn’s top priorities. In this series of articles, we would like to walk you through our journey of improving the performance of the state trie cache.
- Identifying the Cache Problem
- Searching for the Optimal Cache
- Calculating State Trie Cache Misses
- Tuning Cache Size
In this post we will take a look at a problem that we’ve encountered while experimenting with Klaytn and dig into its cause, namely the Golang Garbage Collector (Go GC). Below shows an incident that occurred during our Klaytn test:
When a Klayn node reaches up to 3500 TPS, about 100GB of memory is used. To find out what’s taking up so much space, we looked into the results of the memory profiling provided by Golang.
▶ go tool pprof cn-mem0.prof
File: kcn
Build ID: 7b45b11c163a99518095ffb64083e4aa61fd321f
Type: inuse_space
Time: Mar 26, 2020 at 8:56am (KST)
Entering interactive mode (type “help” for commands, “o” for options)
(pprof) top
Showing nodes accounting for 41.91GB, 96.33% of 43.50GB total
Dropped 382 nodes (cum <= 0.22GB)
Showing top 10 nodes out of 77
flat flat% sum% cum cum%
30GB 68.97% 68.97% 30GB 68.97% github.com/allegro/bigcache/queue.NewBytesQueue
5.65GB 12.98% 81.95% 5.65GB 12.99% github.com/allegro/bigcache.(*cacheShard).set
1.53GB 3.52% 85.47% 1.53GB 3.52% reflect.New
1.25GB 2.87% 88.35% 2.60GB 5.97% github.com/klaytn/klaytn/ser/rlp.decodeBigInt
With memory profiling, you can find out for how much and where memories are being used. From the above results, Showing nodes accounting for 41.91GB, 96.33% of 43.50GB total tells us that the kcn binary is using 43.5GB in total, and through the following logs we can find out what is taking up how much memory for the 96.33%, or 41.91GB. Also, the log result 30GB 68.97% github.com/allegro/bigcache/queue.NewBytesQueue indicates that 30GB of memory (68.97%) are being consumed by BigCache.
We noticed something strange from the above results. According to the storage library provided by Prometheus, about 100GB are being used up by the kcn, while the memory profiling (43.50GB total) tells us that the kcn binary is using a total of 43.5GB. Therefore, it is not trackable where the rest of the 56.5GB (=100GB - 43.5GB) are being used.
We assumed that BigCache, occupying the most memory space, is the main cause of the memory leak. In order to check if BigCache was indeed taking up more memory space, we ran a test on two servers in the same environment with 30GB, 0GB cache size respectively. The following are the results of the `top` command and the memory profiling of the two servers.
(The result of the top command is shown in GiB, whereas the result of the library provided by Prometheus is in GB, so, essentially, both of them are referring to the same value.)
Cypress sync test
AWS Instance : m5.8xlarge
memory size: 128G
cache size: 30G, 0G
On the server with 30GB of BigCache, the top command and memory profiling show that memory usage was 70GB, 35GB respectively, which indicates 35GB of memory leak. On the other hand, on the server that does not use BigCache, top and memory profiling show 5GB, 2GB respectively, which makes the memory leak 3GB.
In the light of this experiment, we can assume that BigCache uses more memory than allocated. And we also found out that even without the Bigcache, about 3GB have been leaked. Moreover, owing to the Go GC (Garbage Collector), the memory usage difference problem occurs on all Go programs.
We found this article that explains how using pointers with a large heap memory for an extended period of time can take up a massive amount of memory.
The Garbage Collector periodically looks for dynamically allocated memory spaces that are no longer usable in order to deallocate them, so that they can be used elsewhere. To do this, it checks for any unused allocated spaces, by searching for pointers. If a Go program is using more than a certain amount of memory and there are a lot of pointers, the Go GC will be using a lot of memory to search them.
The memory usage, which was supposed to be 44GB without the GC, takes up an extra 100GB when the GC is running. Since the memory profiling takes place once the GC is finished, only the 44GB memory usage appeared in the profiling result. In cases where BigCache was allocated, as in Klaytn, Garbage Collection has always been using up a lot more memory.
Such unexpected usage of extra memory can lead to out of memory. For programs like Klaytn, which needs to operate for a long period, a shut-down due to unintended memory overuse must be avoided. In our next post, we will show you how we have managed to solve this memory leak problem.