Redis latency 相關雜記

fcamel

Published in

fcamel的程式開發心得

2 min readJul 22, 2020

Latency 除錯指南

官網的 “Redis latency problems troubleshooting” 介紹相當詳細，若需要自己架 Redis，務必讀過一遍，有些必要設定不是 OS 和 Redis 預設值。

懶得讀的話，至少學會開啟 latency monitor：

CONFIG SET latency-monitor-threshold 100

收集一段時間資料後，然後執行 latency doctor：

127.0.0.1:6379> latency doctor

Dave, I have observed latency spikes in this Redis instance.
You don't mind talking about it, do you Dave?

1. command: 5 latency spikes (average 300ms, mean deviation 120ms,
    period 73.40 sec). Worst all time event 500ms.

I have a few advices for you:
...

頗蠢的設計，不過用起來滿方便的，大概可以解決不少問題。

不用擔心 latency monitor 會影響執行效能。預設沒開啟是因為官方覺得會占用一點點記憶體且是非必要功能，所以沒有像 SLOWLOG 那樣預設開啟。

統計使用頻率

另外一種拖慢校能的原因不是單一指令太慢，而是執行太多指令。可以用 redis-cli monitor 收集一段時間，再自行寫 script 分析出最常執行的指令和 keys。

Fork 的問題

Redis 實作 isolation snapshot 的方式很有趣，不像一堆 database 使用 MVCC，Redis 因為全部資料都在 RAM，採用 fork 然後由 child process 備份。

Linux 的 fork 會採用 copy-on-write 的機制，所以 child process 有和 main process (Redis server) 一模一樣的內容。將 child process 在記憶體內的資料寫入硬碟，就完成備份了。

聽起來相當聰明和簡單的作法，但在記憶體量過大時 (例如 100 GB)，會因為 fork 而卡住 Redis server 數秒。對一個 10ms 都嫌太慢的服務來說，卡住數秒滿糟糕的。唯一的解法是不要讓單一 Redis node 使用太大記憶體，也就是使用 Redis Cluster，控制每個 node 使用少量的記憶體。我實測的情況，Redis 使用 5 GB 時，差不多會花 0.1s fork。

Expie 的問題

Redis 有兩種時機會刪除 expired keys：

存取 key 之前檢查，過期了就刪掉。
每 100ms 檢查一次，隨機抽查數個 keys，刪掉過期的 keys。重覆抽查直到過期的 keys < 25% 為止。

這作法滿有趣的，理論上不用擔心設定 expire 會太花時間，畢竟這是相當常使用的指令。還可以維持占用穩定的 CPU 時間清垃圾，確保不會積太多垃圾占據記憶體。

但這表示若不小心設定大量的 keys 在同一時間過期，過期時會卡住許久，直到過期的 keys 刪到 < 25%。

備忘摘要

文末附上部份讀後的摘要。

即使用 AOF 仍會用 fork 備份

https://redis.io/topics/latency#latency-generated-by-fork

In order to generate the RDB file in background, or to rewrite the Append Only File if AOF persistence is enabled, Redis has to fork background processes. The fork operation (running in the main thread) can induce latency by itself.

Xen 的 fork 更慢，不過 AWS 已修正此問題

https://redis.io/topics/latency#fork-time-in-different-systems

Modern hardware is pretty fast at copying the page table, but Xen is not. The problem with Xen is not virtualization-specific, but Xen-specific. For instance using VMware or Virtual Box does not result into slow fork time.
…
However the good news is that new types of EC2 HVM based instances are much better with fork times, almost on par with physical servers, so for example using m3.medium (or better) instances will provide good results.

Redis 會嘗試避免 fsync 卡住 write，但仍有可能發生

https://redis.io/topics/latency#latency-due-to-aof-and-disk-io

When appendfsync is set to the value of everysec Redis performs an fsync every second. It uses a different thread, and if the fsync is still in progress Redis uses a buffer to delay the write(2) call up to two seconds (since write would block on Linux if an fsync is in progress against the same file). However if the fsync is taking too long Redis will eventually perform the write(2) call even if the fsync is still in progress, and this can be a source of latency.

清 expired keys可能會卡住 Redis

https://redis.io/topics/latency#latency-generated-by-expires

Basically this means that if the database has many many keys expiring in the same second, and these make up at least 25% of the current population of keys with an expire set, Redis can block in order to get the percentage of keys already expired below 25%.
…
In short: be aware that many keys expiring at the same moment can be a source of latency.