Akemi

Redis ASK机制、cluster-node-timeout参数

2024/09/24

特殊机制——ASK路由

错误演示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
下面是一个redis集群
192.168.10.116 redis1
192.168.10.117 redis2
192.168.10.118 redis3
8001端口主
8002端口从

测试脚本
#!/bin/bash
# Redis端口
PORT=8001
# 遍历 1 到 1002
for (( i=1; i<=1002; i++ ))
do
# 构造键和值
KEY="k_$i"
VALUE="v_$i"
# 使用 redis-cli 设置键值对,
redis-cli -p $PORT set "$KEY" "$VALUE"
done

执行该脚本时会报错:
(error) MOVED 6468 192.168.10.117:8001
(error) MOVED 10535 192.168.10.117:8001
(error) MOVED 14598 192.168.10.118:8001
OK
(error) MOVED 6592 192.168.10.117:8001
(error) MOVED 10659 192.168.10.117:8001
(error) MOVED 14722 192.168.10.118:8001
OK

出现这个报错的原因是redis-cli写入key时,连接了一个不属于当前redis实例的slot

所以无法进行写入

解决方法——引入了ASK路由

只要在redis-cli参数中加入-c,使用集群模式,就会使用ASK路由自动进行重定向,自动写到slot目标的redis实例中

redis-cli -c -p $PORT set "$KEY" "$VALUE"

参数——cluster-node-timeout

cluster-node-timeout 5000单位为毫秒ms
主从切换时间:当主节点无响应5000ms后,进行切换

场景:等待时间中第二台故障

如果设置时间过长,可能出现这样的情况:

第一台master无响应,在cluster-node-timeout等待时间中,出现第二台master故障的情况

集群状态如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cluster nodes
c35ba3d5788e974ad6830f5dbdeaa5fa0303bd2e 192.168.10.118:8002@18002 slave 68d676f1a14073941cfeb46e88f281bc85f61c1d 0 1727107349000 11 connected
c37cd7a35beb35c7cf648d54c88f58514fd8dbdd 192.168.10.116:8001@18001 master,fail? - 1727107328357 1727107327000 9 disconnected 0-5460
68d676f1a14073941cfeb46e88f281bc85f61c1d 192.168.10.117:8001@18001 master,fail? - 1727107328860 1727107328760 11 disconnected 6826-12287
68c3eb97a6a9d531f83690af8f780f4ed2af73c4 192.168.10.118:8001@18001 myself,master - 0 1727107349000 10 connected 5461-6825 12288-16383
8324825e7ecf81cf4ae4bf2604b022b8a8f2e068 192.168.10.116:8002@18002 slave 68c3eb97a6a9d531f83690af8f780f4ed2af73c4 0 1727107349347 10 connected
8091088d4c52b74d29185729ff820f1e821bdb30 192.168.10.117:8002@18002 slave c37cd7a35beb35c7cf648d54c88f58514fd8dbdd 0 1727107350354 9 connected

cluster info
cluster_state:fail
cluster_slots_assigned:16384
cluster_slots_ok:5461
cluster_slots_pfail:10923
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3

此时数据还没有丢失

解决方法:直接对redis进程进行重启

1
2
3
4
5
6
7
cluster nodes
c35ba3d5788e974ad6830f5dbdeaa5fa0303bd2e 192.168.10.118:8002@18002 slave 68d676f1a14073941cfeb46e88f281bc85f61c1d 0 1727107543178 11 connected
c37cd7a35beb35c7cf648d54c88f58514fd8dbdd 192.168.10.116:8001@18001 master - 0 1727107544181 9 connected 0-5460
68d676f1a14073941cfeb46e88f281bc85f61c1d 192.168.10.117:8001@18001 master - 0 1727107544081 11 connected 6826-12287
68c3eb97a6a9d531f83690af8f780f4ed2af73c4 192.168.10.118:8001@18001 myself,master - 0 1727107542000 10 connected 5461-6825 12288-16383
8324825e7ecf81cf4ae4bf2604b022b8a8f2e068 192.168.10.116:8002@18002 slave 68c3eb97a6a9d531f83690af8f780f4ed2af73c4 0 1727107543580 10 connected
8091088d4c52b74d29185729ff820f1e821bdb30 192.168.10.117:8002@18002 slave c37cd7a35beb35c7cf648d54c88f58514fd8dbdd 0 1727107543679 9 connected

场景:跨机房/跨地区通信

此时加入了传输时间的变量

需要综合传输时间、信息量进行redis压测后

决定cluster-node-timeout参数大小

CATALOG
  1. 1. 特殊机制——ASK路由
    1. 1.1. 错误演示
  2. 2. 参数——cluster-node-timeout
    1. 2.1. 场景:等待时间中第二台故障
    2. 2.2. 场景:跨机房/跨地区通信