[BUG]Data lost when inject network delay fault to Redis cluster #5107

ahjing99 · 2023-09-13T03:45:12Z

kbcli version
Kubernetes: v1.27.3-gke.100
KubeBlocks: 0.7.0-alpha.8
kbcli: 0.7.0-alpha.8

Steps:

Inject network delay to leader pod

 `kbcli fault network delay --latency=15s -c=100 --jitter=0ms cluster-oqroov-redis-0 --ns-fault=default  --duration=2m`

NetworkChaos network-chaos-65g9m created

the original leader pod cluster-oqroov-redis-0 status is 4/5 ready and still can write, and the role is primary cause there are two primary pods

➜  ~ kbcli cluster connect cluster-oqroov
Connect to instance cluster-oqroov-redis-0: out of cluster-oqroov-redis-0, cluster-oqroov-redis-1
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
127.0.0.1:6379> get mykey
"4"
127.0.0.1:6379> get mykey
"5"
127.0.0.1:6379> get mykey
"5"
127.0.0.1:6379> get mykey
"7"

kbcli cluster describe cluster-oqroov
Name: cluster-oqroov	 Created Time: Sep 13,2023 09:49 UTC+0800
NAMESPACE   CLUSTER-DEFINITION   VERSION       STATUS            TERMINATION-POLICY
default     redis                redis-7.0.6   ConditionsError   WipeOut

Endpoints:
COMPONENT        MODE        INTERNAL                                                        EXTERNAL
redis            ReadWrite   cluster-oqroov-redis.default.svc.cluster.local:6379             <none>
redis-sentinel   ReadWrite   cluster-oqroov-redis-sentinel.default.svc.cluster.local:26379   <none>

Topology:
COMPONENT        INSTANCE                          ROLE      STATUS    AZ              NODE                                                  CREATED-TIME
redis            cluster-oqroov-redis-0            primary   Running   us-central1-c   gke-yjtest-default-pool-47e27321-mvr4/10.128.15.201   Sep 13,2023 10:16 UTC+0800
redis            cluster-oqroov-redis-1            primary   Running   us-central1-c   gke-yjtest-default-pool-47e27321-rbkc/10.128.15.202   Sep 13,2023 09:49 UTC+0800
redis-sentinel   cluster-oqroov-redis-sentinel-0   <none>    Running   us-central1-c   gke-yjtest-default-pool-47e27321-mvr4/10.128.15.201   Sep 13,2023 09:49 UTC+0800
redis-sentinel   cluster-oqroov-redis-sentinel-1   <none>    Running   us-central1-c   gke-yjtest-default-pool-47e27321-h6tl/10.128.15.203   Sep 13,2023 09:50 UTC+0800
redis-sentinel   cluster-oqroov-redis-sentinel-2   <none>    Running   us-central1-c   gke-yjtest-default-pool-47e27321-rbkc/10.128.15.202   Sep 13,2023 09:50 UTC+0800

Resources Allocation:
COMPONENT        DEDICATED   CPU(REQUEST/LIMIT)   MEMORY(REQUEST/LIMIT)   STORAGE-SIZE   STORAGE-CLASS
redis            false       500m / 500m          1Gi / 1Gi               data:5Gi       kb-default-sc
redis-sentinel   false       500m / 500m          1Gi / 1Gi               data:5Gi       kb-default-sc

Images:
COMPONENT        TYPE             IMAGE
redis            redis            registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8
redis-sentinel   redis-sentinel   registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8

Data Protection:
AUTO-BACKUP   BACKUP-SCHEDULE   TYPE     BACKUP-TTL   LAST-SCHEDULE   RECOVERABLE-TIME
Disabled      <none>            <none>   7d           <none>          <none>

Show cluster events: kbcli cluster list-events -n default cluster-oqroov

After fault inject completed and cluster-oqroov-redis-0 recover to 5/5 ready, the role changed to secondary

kbcli cluster describe cluster-oqroov
Name: cluster-oqroov	 Created Time: Sep 13,2023 09:49 UTC+0800
NAMESPACE   CLUSTER-DEFINITION   VERSION       STATUS    TERMINATION-POLICY
default     redis                redis-7.0.6   Running   WipeOut

Endpoints:
COMPONENT        MODE        INTERNAL                                                        EXTERNAL
redis            ReadWrite   cluster-oqroov-redis.default.svc.cluster.local:6379             <none>
redis-sentinel   ReadWrite   cluster-oqroov-redis-sentinel.default.svc.cluster.local:26379   <none>

Topology:
COMPONENT        INSTANCE                          ROLE        STATUS    AZ              NODE                                                  CREATED-TIME
redis            cluster-oqroov-redis-0            secondary   Running   us-central1-c   gke-yjtest-default-pool-47e27321-mvr4/10.128.15.201   Sep 13,2023 10:16 UTC+0800
redis            cluster-oqroov-redis-1            primary     Running   us-central1-c   gke-yjtest-default-pool-47e27321-rbkc/10.128.15.202   Sep 13,2023 09:49 UTC+0800
redis-sentinel   cluster-oqroov-redis-sentinel-0   <none>      Running   us-central1-c   gke-yjtest-default-pool-47e27321-mvr4/10.128.15.201   Sep 13,2023 09:49 UTC+0800
redis-sentinel   cluster-oqroov-redis-sentinel-1   <none>      Running   us-central1-c   gke-yjtest-default-pool-47e27321-h6tl/10.128.15.203   Sep 13,2023 09:50 UTC+0800
redis-sentinel   cluster-oqroov-redis-sentinel-2   <none>      Running   us-central1-c   gke-yjtest-default-pool-47e27321-rbkc/10.128.15.202   Sep 13,2023 09:50 UTC+0800

Resources Allocation:
COMPONENT        DEDICATED   CPU(REQUEST/LIMIT)   MEMORY(REQUEST/LIMIT)   STORAGE-SIZE   STORAGE-CLASS
redis            false       500m / 500m          1Gi / 1Gi               data:5Gi       kb-default-sc
redis-sentinel   false       500m / 500m          1Gi / 1Gi               data:5Gi       kb-default-sc

Images:
COMPONENT        TYPE             IMAGE
redis            redis            registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8
redis-sentinel   redis-sentinel   registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8

Data Protection:
AUTO-BACKUP   BACKUP-SCHEDULE   TYPE     BACKUP-TTL   LAST-SCHEDULE   RECOVERABLE-TIME
Disabled      <none>            <none>   7d           <none>          <none>

Show cluster events: kbcli cluster list-events -n default cluster-oqroov

The data written during the dual primary role period lost

➜  ~ kbcli cluster connect cluster-oqroov
Connect to instance cluster-oqroov-redis-0: out of cluster-oqroov-redis-0, cluster-oqroov-redis-1
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
127.0.0.1:6379> get mykey
"4"
127.0.0.1:6379> get mykey
"5"
127.0.0.1:6379> get mykey
"5"
127.0.0.1:6379> get mykey
"7"
127.0.0.1:6379> get mykey
Error: Server closed the connection
not connected> get mykey
"1"
127.0.0.1:6379>

The text was updated successfully, but these errors were encountered:

github-actions · 2023-10-16T00:17:07Z

This issue has been marked as stale because it has been open for 30 days with no activity

nayutah · 2024-04-24T07:24:24Z

This issue cannot be fixed under the arch of sentinel + redis, when network fault is injected to master pod, both sentinel and redis slave cannot approach the master, so comes up with a network partition, sentinel detects the failure and promotes slave to new master, but some data are written successfully in the old master during the partition time, it is a common case in network partition. But the dual primary/master needs to be fixed ASAP.

nayutah · 2024-04-24T07:32:09Z

For dual primary/master, it can be fixed in the way like Patroni for PostgreSQL, sentinel always keeps the fresh and right info about the cluster, when a failover is done, a role change event can be emitted by sentinel to the lorry sidecar, and the message is passed to KB controller, finally, the partitioned primary pod label is rectified, and the services referring to the 'primary' label also come to a consistent state. During the dual primary phase, some writes from client routed to partitioned primary pod will fail and get reply with 'You can't write against a read only replica'.

ahjing99 added the kind/bug Something isn't working label Sep 13, 2023

ahjing99 added this to the Release 0.7.0 milestone Sep 13, 2023

ahjing99 assigned Y-Rookie Sep 13, 2023

apecloud-bot added the bug label Sep 13, 2023

github-actions bot added the Stale label Oct 16, 2023

ahjing99 modified the milestones: Release 0.7.0, Release 0.8.0 Nov 6, 2023

ahjing99 modified the milestones: Release 0.8.0, Release 0.9.0 Jan 12, 2024

ahjing99 assigned nayutah Mar 4, 2024

nayutah assigned xuriwuyun and kizuna-lek Apr 24, 2024

kizuna-lek mentioned this issue Apr 29, 2024

fix: improve redis get role #7206

Merged

kizuna-lek closed this as completed in #7206 May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Data lost when inject network delay fault to Redis cluster #5107

[BUG]Data lost when inject network delay fault to Redis cluster #5107

ahjing99 commented Sep 13, 2023

github-actions bot commented Oct 16, 2023

nayutah commented Apr 24, 2024

nayutah commented Apr 24, 2024

[BUG]Data lost when inject network delay fault to Redis cluster #5107

[BUG]Data lost when inject network delay fault to Redis cluster #5107

Comments

ahjing99 commented Sep 13, 2023

github-actions bot commented Oct 16, 2023

nayutah commented Apr 24, 2024

nayutah commented Apr 24, 2024