Redis Cluster mode and Key distribution

集群模式(Redis Cluster mode enabled)的運行方式、及鍵值分配(Key distribution)存取的邏輯。

Jerry’s Notes

Published in

What’s next?

8 min readMar 18, 2022

Redis cluster tutorial - Redis

This document is a gentle introduction to Redis Cluster, that does not use difficult to understand concepts of…

redis.io

Redis Cluster Specification - Redis

Welcome to the Redis Cluster Specification. Here you'll find information about algorithms and design rationales of…

redis.io

Redis — Key distribution

Redis 會放鍵值存放於 Total 16384 slots 中，在單節點狀況下，所以在 slots 會在同一節點中。在 cluster mode enabled 多主節點的狀況下，每一個主節點各自管理一部份的 slots。當 Redis 在存取鍵值時，會取得 HASH_SLOT透過 CRC16(key) mod 16384，將去管理該 slot 的節點去存取該鍵值。
以下為範例:
CRC16(FOO) % 16384 = 12182 => put into Server-3

Cluster mode Enable:

Cluster mode Disable:

Key Distribution Model

Redis 儲存數據的方式，一個 Redis 集群，會包含 16384 個哈希槽(hash slot)，當您在寫入數據時，會透過 CRC16(key) % 16384 的方式，來計算鍵 key 屬於哪個槽位(slot)，進而將數據儲存在該槽位(slot)裡。而當 Redis 集群模式的運作模式時，會有多個分片組(shard)，而每個分片組(shard)，負責處理部份的槽位(slot)，Redis 集群模式透過這個方式，來達到多個片組(shard)中的主節點，都能做寫入的操作。
■ With these commands ‘CLUSTER SLOTS’ and ‘CLUSTER NODES’, we can see the mapping information which slots are allocated to which shards in the cluster.
■ By default, ElastiCache will equally distribute slots across shards, though you can also customize the distribution scheme if required.
■The key space is split into 16384 slots.
■ Each Shard in a cluster handles a subset of the 16384 hash slots.

Redis Cluster data sharding

Redis Cluster does not use consistent hashing, but a different form of sharding where every key is conceptually part of what we call a hash slot.
There are 16384 hash slots in Redis Cluster, and to compute what is the hash slot of a given key, we simply take the CRC16 of the key modulo 16384.
Every node in a Redis Cluster is responsible for a subset of the hash slots, so for example you may have a cluster with 3 nodes, where:
Node A contains hash slots from 0 to 5500.
Node B contains hash slots from 5501 to 11000.
Node C contains hash slots from 11001 to 16383.

Q: why 16384 slots rather than 65536?

This is because in cluster mode heartbeat, the message header will contain bitmap value to represent the slot configuration (e.g., 0001101 means I have slot 3, 4 and 6 — the bit with 1 indicates I have that slot), if there are 65536 slots, the slot bitmap part along will consume 65536 / 8 = 8192 byte (8KB) for the header; but if there are just 16384 slots, then the bitmap part of the header will only consume 2 KB. This means saving a lot bandwidth when the cluster constantly sending heartbeat to each other.

Q: What different between Single node, Replication and Cluster mode enable?

A Redis shard is a grouping of one to six related nodes. Replication is implemented by grouping from two to six nodes in a shard. One of these nodes is the read/write primary node. All the other nodes are read-only replica nodes.
Each replica node maintains a copy of the data from the primary node. Replica nodes use asynchronous replication mechanisms to keep synchronized with the primary node. Applications can read from any node in the cluster but can write only to primary nodes. Read replicas enhance scalability by spreading reads across multiple endpoints. Read replicas also improve fault tolerance by maintaining multiple copies of the data. Locating read replicas in multiple Availability Zones further improves fault tolerance.
So, when the primary node is being replaced one of the read-only replicas would be promoted as the master and since the data is synced the new master would have most of the previous data. Hence down time of your application can be drastically reduced. In your case since your cluster has only one node and no read-only replicas. Hence the data was completely wiped off.

Q: What’s Redis Cluster Bus port?

Redis Cluster Bus port 是 Redis 服務的第二個 service port，每個Redis集群中的節點都需要打開兩個TCP連接。一個連接用於正常的給Client提供服務。這個集群總線（Cluster bus）用於節點的失敗偵測、配置更新、故障轉移授權。

!!! 原先 Redis 的 Cluster Bus Port 是 Service Port + 10000，所以在service port 是有範圍的(1150~55535之間)，而這個限制，在Redis 7被移除了，是可以另外設定Cluster Bus Port。此外，ElastiCache Redis 的 Cluster Bus Port 是固定的(不能改)，也跟原生 Redis 有所不同。

Redis 7 new added: cluster-port

cluster-port: 用戶可以自定義集群的綁定端口。
原先是 redis cluster默認的通信(bus)端口 = port + 10000，該值可以動態設置指定端口。ElastiCache Redis cluster port is 1122
$ cat redis-7.0.4/redis.conf | grep cluster-port
# cluster-port 0

Redis Cluster TCP ports

Every Redis Cluster node requires two TCP connections open. The normal Redis TCP port used to serve clients, for example 6379, plus the second port named cluster bus port. The cluster bus port will be derived by adding 10000 to the data port, 16379 in this example, or by overriding it with the cluster-port config.
This second high port is used for the Cluster bus, that is a node-to-node communication channel using a binary protocol. The Cluster bus is used by nodes for failure detection, configuration update, failover authorization and so forth. Clients should never try to communicate with the cluster bus port, but always with the normal Redis command port, however make sure you open both ports in your firewall, otherwise Redis cluster nodes will be not able to communicate.
Note that for a Redis Cluster to work properly you need, for each node:
The normal client communication port (usually 6379) used to communicate with clients to be open to all the clients that need to reach the cluster, plus all the other cluster nodes (that use the client port for keys migrations).
The cluster bus port must be reachable from all the other cluster nodes.

Cluster bus — Failure detection

■ There are two flags that are used for failure detection that are called PFAIL and FAIL.
■ PFAIL means Possible failure, and is a non-acknowledged failure type. <= cluster-node-timeout
■ FAIL means that a node is failing and that this condition was confirmed by a majority of primaries within a fixed amount of time. <= cluster-node-timeout/2.
■ redis.conf: cluster-node-timeout 15000(15 seconds) # Cluster node timeout is the amount of milliseconds a node must be unreachable for it to be considered in failure state.
■ Failover-time ~= pfail(cluster-node-timeout=15 seconds) + fail(cluster-node-timeout/2=7.5 seconds) + 1000(1 second) ~= 23.5 seconds (max).
■ 如果半數以上 Primary 節點與master節點通訊檢查，超過（cluster-node-timeout）配置值時，就會判定該 Primary 節點是掛掉。

[+] redis.conf:
https://raw.githubusercontent.com/redis/redis/3.2/redis.conf
---
# Cluster node timeout is the amount of milliseconds a node must be unreachable
# for it to be considered in failure state.
# Most other internal time limits are multiple of the node timeout.
# 
# cluster-node-timeout 15000
---
* There have only 2 shards. 
If one primary node got failed, the ElastiCache Redis backend system will trigger Recovery Task. If node still can't respond health checks, the backend will trigger primary node failover and node replacement. redis.log
---
5558:S 05 Sep 11:27:16.964 # Cluster state changed: fail
5558:S 05 Sep 11:27:50.428 # Taking over the master (user request).
5558:S 05 Sep 11:27:50.428 # New configEpoch set to 5
5558:M 05 Sep 11:27:50.428 # Setting secondary replication ID to 89452ec5b33d5c2797156f8cff4fbc910fa45e0c, valid up to offset: 3218926459121. New replication ID is c68830cde4f8734565525b5969c3c762a09b5770
5558:M 05 Sep 11:27:50.428 # Connection with master lost.
5558:M 05 Sep 11:27:50.428 * Caching the disconnected master state.
5558:M 05 Sep 11:27:50.428 * Discarding previously cached master state.
5558:M 05 Sep 11:27:50.429 * FAIL message received from 3bb484f80349ca460c809c41168ca86c9b8a9e0f about b94a6155ae4da87b288e3870b9653c8dc52b3dd2
5558:M 05 Sep 11:27:55.448 # Cluster state changed: ok
5558:M 05 Sep 11:33:35.909 * Slave 172.16.0.237:6379 asks for synchronization
---

Primary Election and promotion on Replicas

When a primary is failed, election and promotion is handled by replica nodes, with the help of other primary nodes that vote for the replica to promote. The replica receives ACKs from the majority of primaries, it wins the election.
■ Each replica broadcasting a FAILOVER_AUTH_REQUEST packet to every primary node of the cluster.
■ Once a primary has voted for a given replica, replying positively with a FAILOVER_AUTH_ACK, it can no longer vote for another replica of the same primary for a period.
■ Once the replica receives ACKs from the majority of primaries, it wins the election.
■ Note that a replica waits a short period of time before trying to get elected. The replica with the most updated replication offset will wait for shorter time, so it will likely win the election.

Q. How to reduce the connection overhead for the redirection.

Gathering mapping info in a client side 前端應用要先做去 key slot mapping.

Q. How can a client read on replica nodes in cluster enabled cluster?

command ‘READONLY’ — 在cluster mode enabled 的從節點，必需使用( READONLY )命令，在指定在該從節點上只讀，否則會收MOVED error (redirections) 的訊息．

Q. “crosslot keys in request don’t hash to the same slot” error

Using hashtags or redis-py-cluster client. To implement multi-key operations in a cluster, the keys must be hashed to the same hash slot. 當操作多筆鍵值，分散在多個slot時，會出現(“crosslot keys in request don’t hash to the same slot”)這樣的錯誤，可以用hash key”{…}”的方式來解決。

Q: 如何去取得 Redis 集群節的集群拓墣資訊?

您前端應用程式可以透過「配置終端節點(Endpoint)」，來取得 Redis 集群所有節點的 IP 位置，然後先去對某一個 Redis 節點建立連線，接著先使用 > cluster nodes 命令，來取得該 Redis 集群節的集群拓墣資訊(包含節點主從角色、IP位置、slot配置等等)，然後對所有的主從節點分別建立連接，接著在前端應用程式端(redis client 套件)來控制，這筆記錄是在那一個節點上寫入、或是讀取。

$ redis-cli -c -h xxxx 
> cluster nodes
xxx 127.0.0.1:30004@31004 slave xxx 0 1426238317239 4 connected
xxx 127.0.0.1:30002@31002 master - 0 1426238316232 2 connected 5461-10922
xxx 127.0.0.1:30003@31003 master - 0 1426238318243 3 connected 10923-16383
xxx 127.0.0.1:30005@31005 slave xxx 0 1426238316232 5 connected
xxx 127.0.0.1:30006@31006 slave xxx 0 1426238317741 6 connected
xxx 127.0.0.1:30001@31001 myself,master - 0 0 1 connected 0-5460[+] cluster-nodes : 
https://redis.io/commands/cluster-nodes
---
Each node in a Redis Cluster has its view of the current cluster configuration, given by the set of known nodes, the state of the connection we have with such nodes, their flags, properties and assigned slots, and so forth.
---

Q: 為什麼切換資料庫，收到錯誤訊息”(error) ERR SELECT is not allowed in cluster mode”?

Redis Cluster Specification — Redis:
https://redis.io/topics/cluster-spec
—
Redis Cluster does not support multiple databases like the stand alone version of Redis. There is just database 0 and the SELECT command is not allowed.
—
您是無法在集群模式(Cluster mode enable)，去選擇其他資料庫的。您會得到這個錯誤訊息”(error) ERR SELECT is not allowed in cluster mode”，而這是合理的。

$ redis-cli -c -h xxx.clustercfg.usw2.cache.amazonaws.com
xxx.clustercfg.usw2.cache.amazonaws.com:6379> set key1 db1
-> Redirected to slot [9189] located at 10.0.101.236:6379
OK
10.0.101.236:6379> select 1
(error) ERR SELECT is not allowed in cluster mode
10.0.101.236:6379> select 0
OK### 透過 info Keyspace 來檢查 ###
10.0.100.151:6379> info Keyspace
# Keyspace
db0:keys=2,expires=0,avg_ttl=0Managing Databases
> select 1
(error) ERR SELECT is not allowed in cluster mode

Q: 將 redis (非集群模式) 變更為 redis (集群模式)?

一般情況下我們會請客戶使用 backup/restore 的方式來達成，若客戶的 redis 中包含多個 database，但 redis 集群模式下僅能支援一個 database，若透過 backup/restore 來創建，會導致創建失敗。
!!! 因為原生 Redis 在 Cluster mode 下，是不支援多個 database的。
解決方式 :
1. 透過編輯 rdb 文件來將多個 database 合併為一個
2. 使用 redis [move](https://redis.io/commands/move) 指令來將 key 在不同 database 轉移。 (限制： When key already exists in the destination database, or it does not exist in the source database, it does nothing.)
以下的示例是使用 lua script，來達到自動化搬移 key，思路是使用 SCAN 來掃描來源 database (sdb)，並透過 move 將 key 搬移至目地 database (ddb)。

# cat tt_sdb_ddb.lua
local cursor = '0'
local conflict = {}
local moved = {}
local sdb = '1'
local ddb = '0'redis.replicate_commands()
redis.call("select",sdb)
repeat
local result = redis.call("SCAN", cursor)
cursor = result[1];
for _,key in ipairs(result[2]) do
redis.call('MOVE', key, ddb)
enduntil cursor == "0"使用 --eval 来执行此 lua script,  
# redis-cli -c -h xxx.apne1.cache.amazonaws.com --eval tt_sdb_ddb.lua 0