Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Failover Confusion in DRPC Action Post Hub Recovery #1179

Conversation

BenamarMk
Copy link
Member

If the DRPC Action is set to Failover, for instance, transitioning from C1 to C2, a subsequent failover request from C2 to C1 creates confusion in Ramen post hub recovery because the action didn't change and the only thing that changed is the destination cluster. The solution is to permit failover from C2 to C1 if C1 is accessible.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2258351

@BenamarMk BenamarMk force-pushed the fix-failover-stuck-after-hub-recovery branch from 800e01a to f94cce6 Compare January 17, 2024 20:08
@BenamarMk BenamarMk changed the title If the DRPC Action is set to Failover, for instance, transitioning fr… If the DRPC Action is set to Failover, for instance, transitioning fr…Fix Failover Confusion in DRPC Action Post Hub Recovery Jan 17, 2024
@BenamarMk BenamarMk changed the title If the DRPC Action is set to Failover, for instance, transitioning fr…Fix Failover Confusion in DRPC Action Post Hub Recovery Fix Failover Confusion in DRPC Action Post Hub Recovery Jan 17, 2024
Copy link
Member

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change or the algorithm to determine drpc state. Also the problem, why hub recovery changes the failover cluster? Is this about getting a stale drpc after a hub recovery because we use an old backup?

@@ -28,6 +28,10 @@ type DRState string

// These are the valid values for DRState
const (
// WaitForUser, state recorded in DRPC status to indicate that we are
// waiting for the user to take an action after hub recover.
WaitForUser = DRState("WaitForUser")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How ramen knows that there was a hub recovery? what is the difference between a hub recovery and a new install?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ramen does not know. Trying to rebuild its state based on what is out there.

progress, err := r.determineDRPCState(ctx, drpc, drPolicy, placementObj, dstCluster, log)
progress, msg, err := r.determineDRPCState(ctx, drpc, drPolicy, placementObj, dstCluster, log)

log.Info(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why log here an not in determineDRPCState()? we pass the logger to the function so it can log whatever needed. Here we log something that it returned without any context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It simply avoids logging before returning. We know we are returning a msg, so we log all rerturned messages in one place rather than doing this.

Log(msg)
return msg

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that logging just the msg here without any context will make it hard to read the log line. Willing to let it be addressed in a different PR though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's do that in a different PR.

@BenamarMk
Copy link
Member Author

@nirs I don't understand this change or the algorithm to determine drpc state. Also the problem, why hub recovery changes the failover cluster? Is this about getting a stale drpc after a hub recovery because we use an old backup?

Trying to cover all combinations. And there are many. Think of it like this. A hub was recovered, DRPC restored with either current state or an old state, and also the source of truth (managed cluster) is down.
This PR might help you understand what we are trying to solve.

@BenamarMk BenamarMk force-pushed the fix-failover-stuck-after-hub-recovery branch 2 times, most recently from a25db1d to 499e88a Compare January 23, 2024 16:36
If the DRPC Action is set to Failover, for instance, transitioning from C1 to C2, a subsequent
failover request from C2 to C1 creates confusion in Ramen post hub recovery because the action
didn't change and the only thing that changed is the destination cluster. The solution is to
permit failover from C2 to C1 if C1 is accessible.

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
@BenamarMk BenamarMk force-pushed the fix-failover-stuck-after-hub-recovery branch from 499e88a to 502ac08 Compare January 23, 2024 17:19
@raghavendra-talur raghavendra-talur merged commit 8c3c5ee into RamenDR:main Jan 23, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants