為什麼 Redis 集群發生主從切換、節點替換?

When Redis node instance got failed or scheduled event in this ElastiCache Redis maintenance windows, ElastiCache Redis backend system will trigger node replacement to replace issue nodes. 如何去降低客戶端受影響的時間長短，當節點發生故障、或是需要執行維護作業去更換節點時，不同的狀況底下，客戶端(Redis Client)會有不同的影響時間長短。以下說明在不同 ElastiCache Redis 的架構下，運作的方式、並提供緩解方案的建議。

Published in

What’s next?

30 min readSep 9, 2022

以下分別對單節點(single node)、非叢集模式一主多從(Customer mode disable | Replication mode)、以及叢集模式多主多從(Cluster mode enable with multiple primary nodes)，在不同的狀況底下，客戶端(Redis Client)所造成的影響。

Redis 集群類型:
1) 單節點(Single node)
2) 非叢集模式一主多從(Customer mode disable | Replication mode)
3) 叢集模式多主多從(Cluster mode enable with multiple primary nodes)

節點更換原因:
1) 即定的維護作業(Scheduled event in this ElastiCache Redis maintenance windows)。
2) 底層硬件故障，導致節點實例被更換(Underlying hardware got failed caused node instance got replacement)。

而更換節點，客戶端受影響的時間長短來說。

單節點(Single node) > 非叢集模式一主多從(Replication mode) > 叢集模式多主多從(Cluster mode enable)
!!! 當 “單節點(Single node)” 節點實例發生故障時，客戶端受影響的時間，相對是最多的，但就客戶端連線的管理成本上來說，就是反過來的，叢集模式(Cluster mode enable)的管理成本、客戶端代碼複雜度，相對而言是最高的。

節點更換執行步驟，請參以下 ElastiCache Redis 官方文檔:

[+] Replacing nodes:
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/CacheNodes.NodeReplacement.html

“第一種: 單節點(Single noode)” :當節點發生故障(node failed)時，會有什麼狀況發生。

1) 即定的維護作業(Scheduled event in this ElastiCache Redis maintenance windows)。

若是單節點執行維護更換(replace node by scheduled event)，ElastiCache Redis 會將舊的節點下線、並同步數據到新的節點上，然後再新的節點重新推上線，而這個過程中，客戶端不可用的時間會較長，但這個過程中，Redis 上的數據，原則上是不會丟失的。

In case of standalone node replacement, the old node will be removed from the service, it won’t accept any new requests at this time, then the old node will sync data will the new node. After this process, the new node will be put back online to handle new requests while the old one will be terminated.

[+] Replacing nodes — Amazon ElastiCache for Redis:
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/CacheNodes.NodeReplacement.html#ReplaceStandalone

2) 底層硬件故障，導致節點實例被更換(Underlying hardware got failed caused node instance got replacement)。

ElastiCache Redis 會起一台新的實例，來直接取代舊的實例，此時該 Redis 上的數據就會丟失，並且 “不會” 用最近一次備份的數據來原還。

客戶端影響時間(Redis client impact time):

此時客戶端受影響的時間，就是整個節點實例恢復的時間，而這個過程包含創建一台新的 EC2 實例，所以受影響時間一般會超過5分鐘。

緩解方案:

為您的 Redis 集群添加至少一個副本節點(replica node)，並開啟(multi-AZ)的功能，此時當上述異常發生時，ElastiCache Redis 後台系統，會先執行主從切換(failover)，並更新主終端節點的 DNS 記錄(Primary endpoint)，再去更換有問題的節點實例，當新節點實例恢復後，再去跟當前的主節點去同步數據。

Since our document mentions that using multi-az failover enabled cluster will minimize your service downtime, it is strongly recommended that you shall apply such architecture to increase the availability of your service.

[+] Minimizing downtime in ElastiCache for Redis with Multi-AZ:
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/AutoFailover.html

[+] Adding a read replica, for Redis (Cluster Mode Disabled) replication groups:
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Replication.AddReadReplica.html

As your read traffic increases, you might want to spread those reads across more nodes and reduce the read pressure on any one node. In this topic, you can find how to add a read replica to a Redis (cluster mode disabled) cluster.

“第二種: 非叢集模式一主多從(Replication mode)” 當節點發生故障(node failed)時，會有什麼狀況發生。

1) 即定的維護作業(Scheduled event in this ElastiCache Redis maintenance windows)。

此時會有兩個狀況發生，若是主節點執行維護更換(replace node by scheduled event)，此時會先執行”主從切換(failover)”、並更新終端節點的 DNS 記錄(endpoint record)，然後去更換有問題的節點實例，當新節點實例恢復後，再去跟當前的主節點去同步數據。

客戶端影響時間(Redis client impact time):

此時客戶端受影響的時間，就是在主從切換後，客戶端重新解析主終端節點的 DNS 記錄(Primary endpoint)、及重新建立連線的時間長短。

!!! 主終端節點的 DNS 記錄的 TTL 是 15秒，再加上 DNS Propagation 及 Client DNS cache 的時間差，一般常見客戶端受影響時間是 30秒至 1分鐘。

若是從節點發生故障時(replica node got failed)，此時只會去更換有問題的節點實例，當新節點實例恢復後，再去跟當前的主節點去同步數據。

客戶端影響時間(Redis client impact time):

此時客戶端受影響的時間，只有對只讀副本從節點(replica nodes)的連線異常，若該 Redis 集群有多個只讀副本從節點時，客戶端受影響的程度就會相對是低的。

2) 底層硬件故障，導致節點實例被更換(Underlying hardware got failed caused node instance got replacement)。

此時不管是主節點(primary node)、或是只讀副本從節點(replica nodes)，客戶端受影響時間(Redis client impact time)同上，但還會更長，因為還要加上 ElastiCahe Redis 後台系統，對節點本身監控本身的時間差。ElastiCahe Redis 是使用 CloudWatch 的指標、加上後台監控機制，來判斷該節點實例是否故障，所以加上這個時間差，至少會是 1~5 分鐘的客戶端受影響時間。

!!! 請注意，上述時間並不包含故障節點恢復的時間，當主節點角色(primary role)切換至正常運行的節點上時，客戶端就可以正常去對該 Redis 集群進行操作，但此時仍有故障節點恢復中，而節點恢復的過程中，ElastiCache 後台系統，會起一台新的節點 EC2 實例，待該實例作業系統開始正常運行後，接下來啟動 Redis 服務(此時上面沒有數據)，然後去跟當前主節點去更新數據同步，而這個過程有創建 EC2 實例的動作，所以至少要 3~5 分鐘以上的時間。

緩解方案:

由於節點發生硬件故障，是很非去預防及避免的，所以客戶端(Redis Client)適當的超時(Timeout)、及重試(Retry)機制，來重新解新 “終端節點的 DNS 記錄(endpoint records)” 並重新建立連線，可以有效去降低受影響的時間。

另外由於主從切換後，終端節點(endpoint)的 DNS 記錄資訊，會有 IP 位置改變的行為，所以會有 DNS Propagation、及 Client DNS cache 的時間差，故若要降低這個時間差的影響的話，可以考慮改使用 “叢集模式多主多從(Cluster mode enable)” 的 Redis 集群，然後客戶端做適合的配置調整，可以更有效地，去降低客戶端受影響的時間長短。

補充說明，在 “非叢集模式一主多從(Replication mode)” 下，終端節點(endpoint)的類型及差異:

“主終端節點 (Primary Endpoint)” 及 “讀取器終端節點 (Reader Endpoint)” 都是 DNS 記錄，而主終端節點 (Primary Endpoint)是一筆 DNS 記碌，只會解析出 “當前主節點” 的 IP 位置，而讀取器終端節點 (Reader Endpoint)也是一筆 DNS 記錄，會解析出當前其中一台 “只讀副本從節點” 的 IP 位置。

!!! 您的客戶端(Redis Client)，可以使用這兩個 “終端節點(endpoint)”，分別建立多條連線，來實做 “讀寫分離separate read/write)”，這樣的做法，有助於降低主節點的負載，也可以降低客戶端(Redis Client)受影響的程度。

[+] Finding connection endpoints:
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Endpoints.html

Primary Endpoint: 
The primary endpoint is a DNS name that always resolves to the primary node in the cluster.Reader Endpoint: 
A reader endpoint will evenly split incoming connections to the endpoint between all read replicas in a ElastiCache for Redis cluster.

所以您的客戶端(Redis Client)，要去配置使用 “主終端節點 (Primary Endpoint)” 或 “讀取器終端節點 (Reader Endpoint)” 這兩個 “終端節點位置(Endpoint)” DNS 記錄在您的程式代碼中，當主從切換發生後，您的客戶端(Redis client)會出現超時(Tmeout)、或連線失敗(Connection failed)的狀況發生，此時您的客戶端(Redis client)代碼，需實做超時(Timeout)、及重試(Retry)的機制，再重新去解析 “主終端節點 (Primary Endpoint)” 或 “讀取器終端節點 (Reader Endpoint)” 來取得 “當前” 主節點或是從節點的 IP 位置，然後來重新對正確的節點，來進行建立連線及操作。

!!! 請注意當節點故障時，後台系統會執行 “節點更換” 的行為，在這個過程中，連線一定會中斷，所以您的客戶端(Redis client)代碼，必需實做超時(Timeout)、及重試(Retry)的機制，來重新建立新的連線，來減緩受影響的程度。

“第三種: 叢集模式多主多從(Cluster mode enable)” 當節點發生故障(node failed)時，會有什麼狀況發生。

1) 即定的維護作業(Scheduled event in this ElastiCache Redis maintenance windows)。

此時會有兩個狀況發生，若是主節點執行維護更換(replace node by scheduled event)，此時會先執行”主從切換(failover)”，然後去更換有問題的節點實例，當新節點實例恢復後，再去跟當前的主節點去同步數據，若是從節點發生故障時(replica node got failed)，此時只會去更換有問題的節點實例，當新節點實例恢復後，再去跟當前的主節點去同步數據。

!!! 請注意，此時不會去更新 “配置端点(Configuration Endpoint)” 的 DNS 記錄，因為在 “叢集模式(Cluster mode enable)” 下，該筆 Configuration Endpoint DNS 記錄會解析出 “所有” 節點的 IP 位置，並不會因為主從切換而有所改變。

[+] Configuration Endpoint:
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Endpoints.html

Redis (cluster mode enabled) clusters, use the cluster’s Configuration Endpoint for all operations that support cluster mode enabled commands. You must use a client that supports Redis Cluster (Redis 3.2). You can still read from individual node endpoints (In the API/CLI these are referred to as Read Endpoints)

緩解方案:

一般 Redis client SDK 套件，也是先去連接 “配置端点(Configuration Endpoint)”，會先去解析 DNS 記錄來取得所有節點的 IP 位置，然後對其中一台節點建立連線，然後使用 Redis 原生的 cluster-nodes 命令，來取得 Redis 集群的節點角色拓墣資訊(Node Topology)，再去對每一個節點，去建立連線、進行命令操作。

!!! 所以客戶端(Redis Client) 受影響的時間長短，取決於 “更新角色拓墣資訊(Node Topology)” 的頻率、超時(timeout)及重試(retry)次數等等。

2) 底層硬件故障，導致節點實例被更換(Underlying hardware got failed caused node instance got replacement)。

!!! 此時不管是主節點(primary node)、或是只讀副本從節點(replica nodes)，受影響的時間是不一樣的，所以下面是分開說明運行方式、受影響的程度、及緩解方案的建議，逐一來說明。

主節點(primary node)發生故障:

原生的 Redis cluster，當特定主節點發現異常時，若持續沒有回應其他主節點時，此時如果半數以上 Primary 節點與該異常 Primary 節點通訊檢查失敗，超過（cluster-node-timeout）配置值時，就會判定該 Primary 節點是掛掉/異常，進而觸發主從切換(Replica election and promotion)。

Failover-time ~= pfail(cluster-node-timeout=15 seconds) + fail(cluster-node-timeout/2=7.5 seconds) + 1000(1 second) ~= 23.5 seconds (max).

!!! 請注意，Redis cluster 原生至少要3個主節點，才能有夠有2個節點以上來投票，去決定那一個從節點變成主節點，所以當您的 ElastiCache Redis 集群少於三個分片組，或是同時有兩個以上的主節點發生故障時，ElastiCache Redis 後台機制就會介入去當健檢檢查及投票，來恢復 Redis 集群運行。

[+] redis.conf:
https://raw.githubusercontent.com/redis/redis/3.2/redis.conf

# Cluster node timeout is the amount of milliseconds a node must be unreachable
# for it to be considered in failure state.
# Most other internal time limits are multiple of the node timeout.
# 
# cluster-node-timeout 15000

[+] Redis cluster specification:
https://redis.io/docs/reference/cluster-spec/

Failure detection
Redis Cluster failure detection is used to recognize when a master or replica node is no longer reachable by the majority of nodes and then respond by promoting a replica to the role of master. When replica promotion is not possible the cluster is put in an error state to stop receiving queries from clients.As already mentioned, every node takes a list of flags associated with other known nodes. There are two flags that are used for failure detection that are called PFAIL and FAIL. PFAIL means Possible failure, and is a non-acknowledged failure type. FAIL means that a node is failing and that this condition was confirmed by a majority of masters within a fixed amount of time.PFAIL flag:
A node flags another node with the PFAIL flag when the node is not reachable for more than NODE_TIMEOUT time. Both master and replica nodes can flag another node as PFAIL, regardless of its type.FAIL flag:
The PFAIL flag alone is just local information every node has about other nodes, but it is not sufficient to trigger a replica promotion. For a node to be considered down the PFAIL condition needs to be escalated to a FAIL condition.

[+] Redis cluster specification:
https://redis.io/docs/reference/cluster-spec/

Replica election and promotionReplica election and promotion is handled by replica nodes, with the help of master nodes that vote for the replica to promote. A replica election happens when a master is in FAIL state from the point of view of at least one of its replicas that has the prerequisites in order to become a master.

與此同時，異常節點若持續無法通過 ElastiCahe Redis 健康檢查時，就會觸發 ElastiCahe Redis 節點更換的作業，用新的節點實例來取代有問題的節點，動作跟節點底層硬件故障的狀態是一樣的。但若異常節點只是短暫時間出現異常時(例如底層硬件短暫丟包率上升)，該節點會變更從節點(replica node)的角色，但不一定會被後台更換。

!!! 請注意，更換節點後，會觸發 Redis 原生的同步作業(Redis Replication)，而新節點加入後的第一次同步，一定是全同步(Full sync)，而這個動作會花費主節點資源(包含CPU負載及網路帶寬用量)、也有機率造成操作延遲增加，所以若原節點能夠正常運行，就會保持該節點的狀態。

主節點(primary node)發生故障時，緩解方案:

一般 Redis client SDK 套件，也是先去連接 “配置端点(Configuration Endpoint)”，會先去解析 DNS 記錄來取得所有節點的 IP 位置，然後對其中一台節點建立連線，然後使用 Redis 原生的 cluster-nodes 命令，來取得 Redis 集群的節點角色拓墣資訊(Node Topology)，再去對每一個節點，去建立連線、進行命令操作。

[+] CLUSTER NODES
https://redis.io/commands/cluster-nodes/

Each node in a Redis Cluster has its view of the current cluster configuration, given by the set of known nodes, the state of the connection we have with such nodes, their flags, properties and assigned slots, and so forth.> CLUSTER NODES
xxx 127.0.0.1:30004@31004 slave xxx 0 1426238317239 4 connected
xxx 127.0.0.1:30002@31002 master — 0 1426238316232 2 connected 5461–10922
xxx 127.0.0.1:30003@31003 master — 0 1426238318243 3 connected 10923–16383
xxx 127.0.0.1:30005@31005 slave xxx 0 1426238316232 5 connected
xxx 127.0.0.1:30006@31006 slave xxx 0 1426238317741 6 connected
xxx 127.0.0.1:30001@31001 myself,master — 0 0 1 connected 0–5460

所以當特定分片組(shard group)上、特定主節點(Primary node)發生故障時，因為有主從切換的行為發生，所以客戶端(Redis Client) 受影響的時間長短，取決於 “更新角色拓墣資訊(Node Topology)” 的頻率、超時(timeout)及重試(retry)次數等等。

!! 所以配置適當的 “角色拓墣(Node Topology)” 更新頻率、超時(Timeout)、及重試(Retry)機制，可以有效去降低主從切換(Faiover)後，客戶端(Redis Client) 受影響的時間。

只讀副本從節點(replica nodes):

若只讀副本從節點(replica nodes)節點實例發生故障時，ElastiCahe Redis 後台系統對節點本身監控本身的時間差、ElastiCahe Redis 是使用 CloudWatch 的指標、加上後台監控機制，來判斷該節點實例是否故障，進而執行節點更換作業，所以發生從節點故障受影的時間會長一些。

只讀副本從節點(replica nodes) 故障，緩解方案:

客戶端若有實做 “讀寫分離” 時，此時若特定分片組(shard group)上、特定從節點(replica node)發生故障時，此時讀取的命令會出現超時/失敗的狀況，客戶可以為分片組增加至少2個以上的只讀副本從節點(replica node)，再搭配適當的超時(Timeout)、及重試(Retry)機制，來減緩單一副本從節點發生故障後，所造成的影響。

補充資料:

ElastiCache 在節點更新(node replacement)的說明、影響、及建議。

https://aws.amazon.com/elasticache/elasticache-maintenance/?nc1=h_ls

Q: How long does a node replacement take?A replacement typically completes within a few seconds. The replacement may take longer in certain instance configurations and traffic patterns. For example, Redis primary nodes may not have enough free memory, and may be experiencing high write traffic. When an empty replica syncs from this primary, the primary node may run out of memory trying to address the incoming writes as well as sync the replica. In that case, the master disconnects the replica and restarts the sync process. It may take multiple attempts for replica to sync successfully. It is also possible that replica may never sync if the incoming write traffic continues to remains high.
Memcached nodes do not need sync, so their replacement completes faster, irrespective of node sizes.Q: How does a node replacement impact my application?For Redis nodes, the replacement process is designed to make a best effort to retain your existing data and requires successful Redis replication. For single node Redis clusters, ElastiCache dynamically spins up a replica, replicates the data, and then fails over to it. For replication groups consisting of multiple nodes, ElastiCache replaces the existing replicas and syncs data from the primary to the new replicas. If Multi-AZ with autofailover is enabled, replacing the primary triggers a failover to a read replica. For Redis Cluster configurations that are set up to use Redis Cluster clients, and non-Cluster configurations with auto failover enabled, the planned node replacements complete while the cluster serves incoming write requests. If Multi-AZ is disabled, ElastiCache replaces the primary and then syncs the data from a read replica. The primary node is unavailable during this time, leading to longer write interruption.
 Q: What best practices should I follow for a smooth replacement experience and minimize data loss?
For Redis nodes, the replacement process is designed to make a best effort to retain your existing data and requires successful Redis replication. We try to replace just enough nodes from the same cluster at a time to keep the cluster stable. You can provision primary and read replicas in different availability zones. In this case, when a node is replaced, the data will be synced from a peer node in a different availability zone. We also recommend that you upgrade your Redis engine version to 5.0.6 or higher as those engine versions have improved stability and enable your clusters to continuously serve incoming write requests during patching activities if they have auto-failover enabled. Finally, if your configuration includes only one primary and one single replica per shard, we recommend adding additional replicas prior to the patching. This will prevent reduced availability and risk during the patching process. For single node Redis clusters, we recommend that sufficient memory is available to Redis, as described here. For Redis replication groups with multiple nodes, we also recommend scheduling the replacement during a period with low incoming write traffic.Q: What client configuration best practices should I follow to minimize application interruption during maintenance?For Redis, Cluster mode configuration has the best availability during managed or unmanaged operations and it is always recommended to use a cluster mode supported client which connects to the cluster discovery endpoint. For cluster mode disabled, it is recommended to always use the primary endpoint for all the write operations. The individual node endpoints of the replica nodes can be used for all the read operations. If auto-failover is enabled in the cluster, primary node may change, therefore, the application should confirm the role of the node and update all the read endpoints to ensure that you aren't causing a major load on the master. With auto failover disabled, the role of the node will not change, however the downtime in managed or unmanaged operations is higher as compared to clusters with auto failover enabled. Avoid directing read requests to read replicas only. If you configure your client to direct read requests to read replicas only, ensure that you have atleast two read replicas to avoid any read interruption during maintenance.

為什麼 Redis 集群發生主從切換、節點替換?

而更換節點，客戶端受影響的時間長短來說。

“第一種: 單節點(Single noode)” :當節點發生故障(node failed)時，會有什麼狀況發生。

1) 即定的維護作業(Scheduled event in this ElastiCache Redis maintenance windows)。

2) 底層硬件故障，導致節點實例被更換(Underlying hardware got failed caused node instance got replacement)。

客戶端影響時間(Redis client impact time):

緩解方案:

“第二種: 非叢集模式一主多從(Replication mode)” 當節點發生故障(node failed)時，會有什麼狀況發生。

1) 即定的維護作業(Scheduled event in this ElastiCache Redis maintenance windows)。

客戶端影響時間(Redis client impact time):

客戶端影響時間(Redis client impact time):

2) 底層硬件故障，導致節點實例被更換(Underlying hardware got failed caused node instance got replacement)。

緩解方案:

補充說明，在 “非叢集模式一主多從(Replication mode)” 下，終端節點(endpoint)的類型及差異:

延伸閱讀 (Reference)

“第三種: 叢集模式多主多從(Cluster mode enable)” 當節點發生故障(node failed)時，會有什麼狀況發生。

1) 即定的維護作業(Scheduled event in this ElastiCache Redis maintenance windows)。

緩解方案:

2) 底層硬件故障，導致節點實例被更換(Underlying hardware got failed caused node instance got replacement)。

主節點(primary node)發生故障:

主節點(primary node)發生故障時，緩解方案:

只讀副本從節點(replica nodes):

只讀副本從節點(replica nodes) 故障，緩解方案:

延伸閱讀 (Reference)

補充資料:

ElastiCache 在節點更新(node replacement)的說明、影響、及建議。

延伸閱讀 (Reference)

Written by Jerry’s Notes