-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancing Hub Recovery: Reworking DRPC State Rebuilding Algorithm #1165
Merged
ShyamsundarR
merged 5 commits into
RamenDR:main
from
BenamarMk:reconstruct-drpc-state-after-hub-recovery
Dec 24, 2023
Merged
Enhancing Hub Recovery: Reworking DRPC State Rebuilding Algorithm #1165
ShyamsundarR
merged 5 commits into
RamenDR:main
from
BenamarMk:reconstruct-drpc-state-after-hub-recovery
Dec 24, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
BenamarMk
force-pushed
the
reconstruct-drpc-state-after-hub-recovery
branch
6 times, most recently
from
December 20, 2023 14:53
034e9d6
to
04a1a64
Compare
This commit tackles hub recovery issues by reworking the algorithm responsible for rebuilding the DRPC state. The changes align with the following expectations: 1. Stop Condition for Both Failed Queries: If attempts to query 2 clusters result in failure for both, the process is halted. 2. Initial Deployment without VRGs: If 2 clusters are successfully queried, and no VRGs are found, proceed with the initial deployment. 3. Handling Failures with S3 Store Check: - If 2 clusters are queried, 1 fails, and 0 VRGs are found, perform the following checks: - If the VRG is found in the S3 store, ensure that the DRPC action matches the VRG action. If not, stop until the action is corrected, allowing failover if necessary (set PeerReady). - If the VRG is not found in the S3 store and the failed cluster is not the destination cluster, continue with the initial deployment. 4. Verification and Failover for VRGs on Failover Cluster: If 2 clusters are queried, 1 fails, and 1 VRG is found on the failover cluster, check the action: - If the actions don't match, stop until corrected by the user. - If they match, also stop but allow failover if the VRG in-hand is a secondary. Otherwise, continue. 5. Handling VRGs on Destination Cluster: If 2 clusters are queried successfully and 1 or more VRGs are found, and one of the VRGs is on the destination cluster, perform the following checks: - Continue with the action only if the DRPC and the found VRG action match. - Stop until someone investigates if there is a mismatch, but allow failover to take place (set PeerReady). 6. Otherwise, default to allowing Failover: If none of the above conditions apply, allow failover (set PeerReady) but stop until someone makes the necessary change. Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
BenamarMk
force-pushed
the
reconstruct-drpc-state-after-hub-recovery
branch
from
December 20, 2023 15:21
04a1a64
to
1107ea6
Compare
…n using AppSet Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
BenamarMk
force-pushed
the
reconstruct-drpc-state-after-hub-recovery
branch
from
December 20, 2023 20:11
a1a2467
to
f348ce1
Compare
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
BenamarMk
force-pushed
the
reconstruct-drpc-state-after-hub-recovery
branch
2 times, most recently
from
December 22, 2023 20:27
e9267e0
to
b534f0e
Compare
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
BenamarMk
force-pushed
the
reconstruct-drpc-state-after-hub-recovery
branch
from
December 23, 2023 22:05
b534f0e
to
accaed3
Compare
ShyamsundarR
approved these changes
Dec 24, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed wit @BenamarMk on various flows and states. Acking the PR based on the review.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR tackles hub recovery issues by reworking the algorithm responsible for rebuilding the DRPC state. The changes align with the following expectations:
Stop Condition for Both Failed Queries:
If attempts to query 2 clusters result in failure for both, the process is halted.
Initial Deployment without VRGs:
If 2 clusters are successfully queried, and no VRGs are found, proceed with the initial deployment.
Handling Failures with S3 Store Check:
Verification and Failover for VRGs on Failover Cluster:
If 2 clusters are queried, 1 fails, and 1 VRG is found on the failover cluster, check the action:
Handling VRGs on Destination Cluster:
If 2 clusters are queried successfully and 1 or more VRGs are found, and one of the VRGs is on the destination cluster, perform the following checks:
Otherwise, default to allowing Failover:
If none of the above conditions apply, allow failover (set PeerReady) but stop until someone makes the necessary change.
Testing: DRPC output in various states
In this test, we have 6 workloads. 3 workloads in
dr1
and another 3 indr2
. All 3 were in different actions for each cluster.busybox-sample
recovered, waiting for dr2 to comeback online in order to finish the clean upbusybox-samples-1
is completed. It is already deployed ondr1
busybox-samples-2
isPaused
waiting for the user to failover todr1
busybox-samples-3
isPaused
failover is allowed (PeerReady is set) but also, the state is different from what the VRG has, so it needs user intervention.busybox-samples-4
is the same asbusybox-samples-3
. The difference is thatbusybox-samples-4
hasPeerReady
set to false. That's because you can't failover todr2
. It is down.busybox-samples-5
was deployed todr2
, so it stayed intact with the ability to failover todr1
(PeerReady is set)Addresses Jira: [Hub Recovery] Add support for active hub co-situated with the managed cluster