Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing Hub Recovery: Reworking DRPC State Rebuilding Algorithm #1165

Merged

Conversation

BenamarMk
Copy link
Member

@BenamarMk BenamarMk commented Dec 19, 2023

This PR tackles hub recovery issues by reworking the algorithm responsible for rebuilding the DRPC state. The changes align with the following expectations:

  1. Stop Condition for Both Failed Queries:
    If attempts to query 2 clusters result in failure for both, the process is halted.

  2. Initial Deployment without VRGs:
    If 2 clusters are successfully queried, and no VRGs are found, proceed with the initial deployment.

  3. Handling Failures with S3 Store Check:

    • If 2 clusters are queried, 1 fails, and 0 VRGs are found, perform the following checks:
      • If the VRG is found in the S3 store, ensure that the DRPC action matches the VRG action. If not, stop until the action is corrected, allowing failover if necessary (set PeerReady).
      • If the VRG is not found in the S3 store and the failed cluster is not the destination cluster, continue with the initial deployment.
  4. Verification and Failover for VRGs on Failover Cluster:
    If 2 clusters are queried, 1 fails, and 1 VRG is found on the failover cluster, check the action:

    • If the actions don't match, stop until corrected by the user.
    • If they match, also stop but allow failover if the VRG in-hand is a secondary. Otherwise, continue.
  5. Handling VRGs on Destination Cluster:
    If 2 clusters are queried successfully and 1 or more VRGs are found, and one of the VRGs is on the destination cluster, perform the following checks:

    • Continue with the action only if the DRPC and the found VRG action match.
    • Stop until someone investigates if there is a mismatch, but allow failover to take place (set PeerReady).
  6. Otherwise, default to allowing Failover:
    If none of the above conditions apply, allow failover (set PeerReady) but stop until someone makes the necessary change.

Testing: DRPC output in various states

oc get drpc -A -o wide
NAMESPACE           NAME           AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION      START TIME             DURATION   PEER READY
busybox-sample      busybox-drpc   6h33m   dr1                dr2               Relocate       Relocated      Cleaning Up                                        False
busybox-samples-1   busybox-drpc   149m    dr1                                                 Deployed       Completed                                          True
busybox-samples-2   busybox-drpc   140m    dr1                dr2               Failover                      Paused                                             True
busybox-samples-3   busybox-drpc   137m    dr2                                                                Paused                                             True
busybox-samples-4   busybox-drpc   137m    dr1                dr1                                             Paused                                             False
busybox-samples-5   busybox-drpc   130m    dr2                                                 Deployed       UpdatingPlRule   2023-12-20T03:52:09Z              True

In this test, we have 6 workloads. 3 workloads in dr1 and another 3 in dr2. All 3 were in different actions for each cluster.

  • busybox-sample recovered, waiting for dr2 to comeback online in order to finish the clean up
  • busybox-samples-1 is completed. It is already deployed on dr1
  • busybox-samples-2 is Paused waiting for the user to failover to dr1
  • busybox-samples-3 is Paused failover is allowed (PeerReady is set) but also, the state is different from what the VRG has, so it needs user intervention.
  • busybox-samples-4 is the same as busybox-samples-3. The difference is that busybox-samples-4 has PeerReady set to false. That's because you can't failover to dr2. It is down.
  • busybox-samples-5 was deployed to dr2, so it stayed intact with the ability to failover to dr1 (PeerReady is set)

Addresses Jira: [Hub Recovery] Add support for active hub co-situated with the managed cluster

@BenamarMk BenamarMk force-pushed the reconstruct-drpc-state-after-hub-recovery branch 6 times, most recently from 034e9d6 to 04a1a64 Compare December 20, 2023 14:53
This commit tackles hub recovery issues by reworking the algorithm responsible for
rebuilding the DRPC state. The changes align with the following expectations:

1. Stop Condition for Both Failed Queries:
   If attempts to query 2 clusters result in failure for both, the process is halted.

2. Initial Deployment without VRGs:
   If 2 clusters are successfully queried, and no VRGs are found, proceed with the
   initial deployment.

3. Handling Failures with S3 Store Check:
   - If 2 clusters are queried, 1 fails, and 0 VRGs are found, perform the following checks:
      - If the VRG is found in the S3 store, ensure that the DRPC action matches the VRG action.
      If not, stop until the action is corrected, allowing failover if necessary (set PeerReady).
      - If the VRG is not found in the S3 store and the failed cluster is not the destination
      cluster, continue with the initial deployment.

4. Verification and Failover for VRGs on Failover Cluster:
   If 2 clusters are queried, 1 fails, and 1 VRG is found on the failover cluster, check
   the action:
      - If the actions don't match, stop until corrected by the user.
      - If they match, also stop but allow failover if the VRG in-hand is a secondary.
      Otherwise, continue.

5. Handling VRGs on Destination Cluster:
   If 2 clusters are queried successfully and 1 or more VRGs are found, and one of the
   VRGs is on the destination cluster, perform the following checks:
      - Continue with the action only if the DRPC and the found VRG action match.
      - Stop until someone investigates if there is a mismatch, but allow failover to
      take place (set PeerReady).

6. Otherwise, default to allowing Failover:
   If none of the above conditions apply, allow failover (set PeerReady) but stop until
   someone makes the necessary change.

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
@BenamarMk BenamarMk force-pushed the reconstruct-drpc-state-after-hub-recovery branch from 04a1a64 to 1107ea6 Compare December 20, 2023 15:21
Benamar Mekhissi added 2 commits December 20, 2023 10:46
…n using AppSet

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
@BenamarMk BenamarMk force-pushed the reconstruct-drpc-state-after-hub-recovery branch from a1a2467 to f348ce1 Compare December 20, 2023 20:11
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
@BenamarMk BenamarMk force-pushed the reconstruct-drpc-state-after-hub-recovery branch 2 times, most recently from e9267e0 to b534f0e Compare December 22, 2023 20:27
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
@BenamarMk BenamarMk force-pushed the reconstruct-drpc-state-after-hub-recovery branch from b534f0e to accaed3 Compare December 23, 2023 22:05
Copy link
Member

@ShyamsundarR ShyamsundarR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed wit @BenamarMk on various flows and states. Acking the PR based on the review.

@ShyamsundarR ShyamsundarR merged commit 206c862 into RamenDR:main Dec 24, 2023
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants