Fix Failover Confusion in DRPC Action Post Hub Recovery #1179

BenamarMk · 2024-01-17T19:48:23Z

If the DRPC Action is set to Failover, for instance, transitioning from C1 to C2, a subsequent failover request from C2 to C1 creates confusion in Ramen post hub recovery because the action didn't change and the only thing that changed is the destination cluster. The solution is to permit failover from C2 to C1 if C1 is accessible.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2258351

nirs

I don't understand this change or the algorithm to determine drpc state. Also the problem, why hub recovery changes the failover cluster? Is this about getting a stale drpc after a hub recovery because we use an old backup?

nirs · 2024-01-22T15:36:03Z

api/v1alpha1/drplacementcontrol_types.go

@@ -28,6 +28,10 @@ type DRState string

 // These are the valid values for DRState
 const (
+	// WaitForUser, state recorded in DRPC status to indicate that we are
+	// waiting for the user to take an action after hub recover.
+	WaitForUser = DRState("WaitForUser")


How ramen knows that there was a hub recovery? what is the difference between a hub recovery and a new install?

Ramen does not know. Trying to rebuild its state based on what is out there.

nirs · 2024-01-22T15:39:11Z

controllers/drplacementcontrol_controller.go

-	progress, err := r.determineDRPCState(ctx, drpc, drPolicy, placementObj, dstCluster, log)
+	progress, msg, err := r.determineDRPCState(ctx, drpc, drPolicy, placementObj, dstCluster, log)
+
+	log.Info(msg)


Why log here an not in determineDRPCState()? we pass the logger to the function so it can log whatever needed. Here we log something that it returned without any context.

It simply avoids logging before returning. We know we are returning a msg, so we log all rerturned messages in one place rather than doing this.

Log(msg) return msg

I do agree that logging just the msg here without any context will make it hard to read the log line. Willing to let it be addressed in a different PR though.

Ok, let's do that in a different PR.

BenamarMk · 2024-01-22T16:10:03Z

@nirs I don't understand this change or the algorithm to determine drpc state. Also the problem, why hub recovery changes the failover cluster? Is this about getting a stale drpc after a hub recovery because we use an old backup?

Trying to cover all combinations. And there are many. Think of it like this. A hub was recovered, DRPC restored with either current state or an old state, and also the source of truth (managed cluster) is down.
This PR might help you understand what we are trying to solve.

If the DRPC Action is set to Failover, for instance, transitioning from C1 to C2, a subsequent failover request from C2 to C1 creates confusion in Ramen post hub recovery because the action didn't change and the only thing that changed is the destination cluster. The solution is to permit failover from C2 to C1 if C1 is accessible. Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>

BenamarMk force-pushed the fix-failover-stuck-after-hub-recovery branch from 800e01a to f94cce6 Compare January 17, 2024 20:08

BenamarMk changed the title ~~If the DRPC Action is set to Failover, for instance, transitioning fr…~~ If the DRPC Action is set to Failover, for instance, transitioning fr…Fix Failover Confusion in DRPC Action Post Hub Recovery Jan 17, 2024

BenamarMk changed the title ~~If the DRPC Action is set to Failover, for instance, transitioning fr…Fix Failover Confusion in DRPC Action Post Hub Recovery~~ Fix Failover Confusion in DRPC Action Post Hub Recovery Jan 17, 2024

BenamarMk requested a review from ShyamsundarR January 18, 2024 01:33

nirs reviewed Jan 22, 2024

View reviewed changes

rakeshgm approved these changes Jan 23, 2024

View reviewed changes

BenamarMk force-pushed the fix-failover-stuck-after-hub-recovery branch 2 times, most recently from a25db1d to 499e88a Compare January 23, 2024 16:36

raghavendra-talur approved these changes Jan 23, 2024

View reviewed changes

BenamarMk force-pushed the fix-failover-stuck-after-hub-recovery branch from 499e88a to 502ac08 Compare January 23, 2024 17:19

raghavendra-talur merged commit 8c3c5ee into RamenDR:main Jan 23, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Failover Confusion in DRPC Action Post Hub Recovery #1179

Fix Failover Confusion in DRPC Action Post Hub Recovery #1179

BenamarMk commented Jan 17, 2024

nirs left a comment

nirs Jan 22, 2024

BenamarMk Jan 22, 2024

nirs Jan 22, 2024

BenamarMk Jan 22, 2024

raghavendra-talur Jan 23, 2024

BenamarMk Jan 23, 2024

BenamarMk commented Jan 22, 2024

Fix Failover Confusion in DRPC Action Post Hub Recovery #1179

Fix Failover Confusion in DRPC Action Post Hub Recovery #1179

Conversation

BenamarMk commented Jan 17, 2024

nirs left a comment

Choose a reason for hiding this comment

nirs Jan 22, 2024

Choose a reason for hiding this comment

BenamarMk Jan 22, 2024

Choose a reason for hiding this comment

nirs Jan 22, 2024

Choose a reason for hiding this comment

BenamarMk Jan 22, 2024

Choose a reason for hiding this comment

raghavendra-talur Jan 23, 2024

Choose a reason for hiding this comment

BenamarMk Jan 23, 2024

Choose a reason for hiding this comment

BenamarMk commented Jan 22, 2024