Enhancing Hub Recovery: Reworking DRPC State Rebuilding Algorithm #1165

BenamarMk · 2023-12-19T21:29:19Z

This PR tackles hub recovery issues by reworking the algorithm responsible for rebuilding the DRPC state. The changes align with the following expectations:

Stop Condition for Both Failed Queries:
If attempts to query 2 clusters result in failure for both, the process is halted.
Initial Deployment without VRGs:
If 2 clusters are successfully queried, and no VRGs are found, proceed with the initial deployment.
Handling Failures with S3 Store Check:
- If 2 clusters are queried, 1 fails, and 0 VRGs are found, perform the following checks:
  - If the VRG is found in the S3 store, ensure that the DRPC action matches the VRG action. If not, stop until the action is corrected, allowing failover if necessary (set PeerReady).
  - If the VRG is not found in the S3 store and the failed cluster is not the destination cluster, continue with the initial deployment.
Verification and Failover for VRGs on Failover Cluster:
If 2 clusters are queried, 1 fails, and 1 VRG is found on the failover cluster, check the action:
- If the actions don't match, stop until corrected by the user.
- If they match, also stop but allow failover if the VRG in-hand is a secondary. Otherwise, continue.
Handling VRGs on Destination Cluster:
If 2 clusters are queried successfully and 1 or more VRGs are found, and one of the VRGs is on the destination cluster, perform the following checks:
- Continue with the action only if the DRPC and the found VRG action match.
- Stop until someone investigates if there is a mismatch, but allow failover to take place (set PeerReady).
Otherwise, default to allowing Failover:
If none of the above conditions apply, allow failover (set PeerReady) but stop until someone makes the necessary change.

Testing: DRPC output in various states

oc get drpc -A -o wide
NAMESPACE           NAME           AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION      START TIME             DURATION   PEER READY
busybox-sample      busybox-drpc   6h33m   dr1                dr2               Relocate       Relocated      Cleaning Up                                        False
busybox-samples-1   busybox-drpc   149m    dr1                                                 Deployed       Completed                                          True
busybox-samples-2   busybox-drpc   140m    dr1                dr2               Failover                      Paused                                             True
busybox-samples-3   busybox-drpc   137m    dr2                                                                Paused                                             True
busybox-samples-4   busybox-drpc   137m    dr1                dr1                                             Paused                                             False
busybox-samples-5   busybox-drpc   130m    dr2                                                 Deployed       UpdatingPlRule   2023-12-20T03:52:09Z              True

In this test, we have 6 workloads. 3 workloads in dr1 and another 3 in dr2. All 3 were in different actions for each cluster.

busybox-sample recovered, waiting for dr2 to comeback online in order to finish the clean up
busybox-samples-1 is completed. It is already deployed on dr1
busybox-samples-2 is Paused waiting for the user to failover to dr1
busybox-samples-3 is Paused failover is allowed (PeerReady is set) but also, the state is different from what the VRG has, so it needs user intervention.
busybox-samples-4 is the same as busybox-samples-3. The difference is that busybox-samples-4 has PeerReady set to false. That's because you can't failover to dr2. It is down.
busybox-samples-5 was deployed to dr2, so it stayed intact with the ability to failover to dr1 (PeerReady is set)

Addresses Jira: [Hub Recovery] Add support for active hub co-situated with the managed cluster

This commit tackles hub recovery issues by reworking the algorithm responsible for rebuilding the DRPC state. The changes align with the following expectations: 1. Stop Condition for Both Failed Queries: If attempts to query 2 clusters result in failure for both, the process is halted. 2. Initial Deployment without VRGs: If 2 clusters are successfully queried, and no VRGs are found, proceed with the initial deployment. 3. Handling Failures with S3 Store Check: - If 2 clusters are queried, 1 fails, and 0 VRGs are found, perform the following checks: - If the VRG is found in the S3 store, ensure that the DRPC action matches the VRG action. If not, stop until the action is corrected, allowing failover if necessary (set PeerReady). - If the VRG is not found in the S3 store and the failed cluster is not the destination cluster, continue with the initial deployment. 4. Verification and Failover for VRGs on Failover Cluster: If 2 clusters are queried, 1 fails, and 1 VRG is found on the failover cluster, check the action: - If the actions don't match, stop until corrected by the user. - If they match, also stop but allow failover if the VRG in-hand is a secondary. Otherwise, continue. 5. Handling VRGs on Destination Cluster: If 2 clusters are queried successfully and 1 or more VRGs are found, and one of the VRGs is on the destination cluster, perform the following checks: - Continue with the action only if the DRPC and the found VRG action match. - Stop until someone investigates if there is a mismatch, but allow failover to take place (set PeerReady). 6. Otherwise, default to allowing Failover: If none of the above conditions apply, allow failover (set PeerReady) but stop until someone makes the necessary change. Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>

…n using AppSet Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>

ShyamsundarR

Discussed wit @BenamarMk on various flows and states. Acking the PR based on the review.

BenamarMk force-pushed the reconstruct-drpc-state-after-hub-recovery branch 6 times, most recently from 034e9d6 to 04a1a64 Compare December 20, 2023 14:53

BenamarMk force-pushed the reconstruct-drpc-state-after-hub-recovery branch from 04a1a64 to 1107ea6 Compare December 20, 2023 15:21

Benamar Mekhissi added 2 commits December 20, 2023 10:46

Fix one place where drpcNamespace is used instead of vrgNamespace whe…

72f7af9

…n using AppSet Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>

Fix unit test failures

f348ce1

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>

BenamarMk force-pushed the reconstruct-drpc-state-after-hub-recovery branch from a1a2467 to f348ce1 Compare December 20, 2023 20:11

BenamarMk requested a review from ShyamsundarR December 20, 2023 20:18

Add unit tests for hub recovery

ea6fdba

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>

BenamarMk force-pushed the reconstruct-drpc-state-after-hub-recovery branch 2 times, most recently from e9267e0 to b534f0e Compare December 22, 2023 20:27

Check access to VRG on a MC before deleting the MW

accaed3

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>

BenamarMk force-pushed the reconstruct-drpc-state-after-hub-recovery branch from b534f0e to accaed3 Compare December 23, 2023 22:05

ShyamsundarR approved these changes Dec 24, 2023

View reviewed changes

ShyamsundarR merged commit 206c862 into RamenDR:main Dec 24, 2023
13 of 14 checks passed

BenamarMk mentioned this pull request Jan 22, 2024

Fix Failover Confusion in DRPC Action Post Hub Recovery #1179

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing Hub Recovery: Reworking DRPC State Rebuilding Algorithm #1165

Enhancing Hub Recovery: Reworking DRPC State Rebuilding Algorithm #1165

BenamarMk commented Dec 19, 2023 •

edited

Loading

ShyamsundarR left a comment

Enhancing Hub Recovery: Reworking DRPC State Rebuilding Algorithm #1165

Enhancing Hub Recovery: Reworking DRPC State Rebuilding Algorithm #1165

Conversation

BenamarMk commented Dec 19, 2023 • edited Loading

ShyamsundarR left a comment

Choose a reason for hiding this comment

BenamarMk commented Dec 19, 2023 •

edited

Loading