Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support disable DR when cluster is unavailable #1147

Merged
merged 6 commits into from
Dec 13, 2023

Conversation

nirs
Copy link
Member

@nirs nirs commented Nov 22, 2023

When cluster is not available, disabling DR get suck waiting for the vrgs.

On the remaining cluster, the vrg is stuck trying to delete data from the s3 store on the unavailable cluster.

On the hub, the drpc is stuck waiting for the stuck vrg, and for the stale vrg reported bu the managedclusterview of the unavailable cluster.

To fix the issue we require the admin to delete the failed drcluter after failing over to another cluster. The system will avoid accessing s3 profiles on deleted drcluster and do not wait for VRG on deleted clusters.

Disable DR flow when a cluster is not available

  1. For each application on the failed cluster, fail over to the secondary cluster

  2. Mark the failed drcluster for deletion on the hub

    kubectl delete drcluster clustername --wait=false --context hub
    

    Using --wait=false since the drcluster will not be deleted until the drpolicy referencing it will be deleted, and the drpolicy will not be deleted before the drpc referencing it will be deleted.

  3. For each application disable DR:

    1. Ensure Placement is pointing to failover cluster

    2. Wait until the vrg for the application has only one s3 profile:

      kubectl get vrg -n appnamespace --context cluster2 \
          -o jsonpath='{.items[0].spec.s3Profiles}' | jq
      

      This can take few minutes after the drcluster was deleted.

    3. Delete the drpc resource for the OCM application on the hub

      kubectl delete drpc drpcname -n drpcnamespace --context hub
      

      If the drpc delete does not finish, check that the vrg does not have s3 profile for the failed cluster. If it does, you need to edit the vrg and remove it manually.

    4. Change the Placement to be reconciled by OCM

      kubectl annotate placement placementname -n drpcnamespace \
         cluster.open-cluster-management.io/experimental-scheduling-disable-
      

Known issues

  • Docs on disabling DR needed

  • It can take few minutes after deleting the drcluster until the vrg on failover cluster updates to exclude the s3 profile. If the drpc is deleted after the vrg is updated, the delete will complete quickly.

  • If disable DR is started before the vrg is updated on the failover cluster, disabling DR will not complete, since deleting a drpc deletes the manifestwork for the vrg, and after that no change in the manifest work is propagated to the managed cluster.

    • You need to edit the vrg on the managed cluster and delete the s3 profile for the deleted drcluster
  • With volsync (cephfs) disabling dr also delete the PVC: [VolSync] Ensure Application PVC is not Automatically Garbage Collected upon VRG Deletion #1150

Testing

  1. Update the the ramen-operator image in csv on the hub and drcluster:

    image: quay.io/nirsof/ramen-operator:drcluster-unavailable-v6
    
  2. If upgrading an older version (e.g. 4.14), apply the CRDs from main:

    for cluster in hub cluster1 cluster2; do
        oc apply -k 'https://github.com/RamenDR/ramen.git/config/crd' --context $cluster
    done
    

Status

Tested using RamenDR/ocm-ramen-samples#41 on top of #1153

basic-test

drenv:

  • busybox-regional-rbd-deploy
  • busybox-regional-rbd-sts
  • busybox-regional-rbd-ds

ocp:

  • busybox-regional-rbd-deploy
  • busybox-regional-cephfs-deploy

Disable DR flow

Tested without the commit updating the vrg.

drenv:

  • busybox-regional-rbd-deploy

ocp:

  • rbd sub
  • rbd appset
  • cephfs sub
  • cephfs appset

Todo:

  • Test that VRG is regenerated when drclsuter is deleted
    • took 8 minutes because we don't watch drcluster delete events
  • Test watching delete events - need more work on filtering drclusters events

Fixes #1139

@nirs

This comment was marked as outdated.

@nirs nirs force-pushed the drcluster-unavailable branch 3 times, most recently from 22dff67 to f404c45 Compare November 30, 2023 22:26
@nirs
Copy link
Member Author

nirs commented Dec 5, 2023

Note from discussion with @ShyamsundarR

  • using annotation (or label) allows going back and making the cluster availlable again, which can lead to stale data in s3 store (was not deleted during disable dr). We want one way change that cannot be undone
  • Currently this works only when the annotation is applied before failover, since exiting VRG are not updated, so user need to add the annotation before failover. But failover is likely to be done quickly (to restore the service for users) before understanding that the cluster will never be available again. Since we don't want to be able to make a cluster available after it was marked as unavailable, we want to do this before disable dr, or maybe only if disable dr got stuck.
  • We can delete the drcluster instead of adding an annotation - this cannot be undone, and does not require teaching users about the new annotation.
  • The code should behave correctly when a drcluster is deleted, but not sure this is tested
  • Ideally when deleting the drcluster that drpolicy will become invalid, but not sure that this will be easy to do now
  • To update VRG when drcluster changes we can call the code creating or updating the vrg on every reconcile (it should read state from the cache so this should not be an issue)

Wanted behavior when drcluster is deleted:

  • Creating new drpc must fail
  • Relocate must fail
  • Failover must succeed

@nirs nirs force-pushed the drcluster-unavailable branch 3 times, most recently from 8b29b1f to 355263f Compare December 11, 2023 21:39
@nirs nirs marked this pull request as ready for review December 11, 2023 21:39
@nirs
Copy link
Member Author

nirs commented Dec 11, 2023

@ShyamsundarR @BenamarMk please review.

In cluster replacement flow, we may need to disable DR when a drcluster
is not available, which breaks disable DR flow in many places.

To allow cleanup to proceed on the hub and on the remaining managed
cluster, the user will have to delete the drcluster for the failed
cluster.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
When getting or updating vrgs from managed clusters, we passed the
drpolicy, and used it's drcluster names for fetching the vrg. To handle
unavailable drcluster, we need the DRCluster type instead of it name,
but we don't have the required context and client to get it in
getVRGsFromManagedClusters().

Fixed by getting the drclusters in the caller when we have the context
and client, and pass down the drcluster list instead of the drpolicy.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
When drcluster is unreachable the ManagedClusterView may continue to
report the last status for the VRG and no error. This fails disable DR
when we wait until all VRGs are deleted.

Fixed by ignoring VRGs on deleted drclusters. Once the VRG on the
available drcluster is deleted, VRG count reach zero and we can complete
drpc deletion.

Since this change is in the getVRGsFromManagedClusters() is also affects
creatingDRPCInstance and maintainance mode. I'm not sure about these
changes yet, but it seems a good idea not to handle VRG on a deleted
drcluster.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Copy link
Member

@ShyamsundarR ShyamsundarR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed changes done. Still reviewing impact to other parts of the code because of the changes and other use-cases that may need to handle DRCluster in a deleted state.

controllers/drplacementcontrol.go Show resolved Hide resolved
@@ -1453,16 +1453,8 @@ func (d *DRPCInstance) createVRGManifestWork(homeCluster string, repState rmn.Re
return nil
}

// ensureVRGManifestWork ensures that the VRG manifestwork exists and matches the current VRG state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now update MW for VRG from the cached MCV VRG replicationState when this function is invoked.

While this is not a problem when called from ensureActionCompleted where its callers ensure that the cached VRG is primary and we then invoke ensureVRGManifestWork.

There are some gaps in the invocation from RunInitialDeployment that needs closer inspection.

For example the code that decides a VRG exists on one of the clusters is not checking if that VRG exists as primary, if that was done using the replication state from the MCV is fine, as it would always be primary when this function is called.

This function ensureVRGManifestWork is currently only safe when VRG MCV is Primary and desired VRG MW is also Primary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the documentation and look at callers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this issue new? this changes only calls when the vrg exists - this could be called for the wrong vrg state before this change if the vrg manifest work was deleted, no?

for i := range d.drClusters {
c := &d.drClusters[i]
if !drClusterIsDeleted(c) {
r = append(r, c.Spec.S3ProfileName)
Copy link
Member Author

@nirs nirs Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but took 8 minutes before the drpc controller detect the cluster is deleted when updating the vrg. Looks like we have a caching issue.

  1. drcluster controller find tha the drcluster was deleted
2023-12-12T16:15:34.356Z	INFO	controllers.DRCluster	controllers/drcluster_controller.go:272	reconcile enter	{"name": "dr1", "rid": "bf896718-4c69-41a0-bb95-9182e04a42ac"}
2023-12-12T16:15:34.356Z	INFO	controllers.DRCluster	controllers/drcluster_controller.go:453	delete	{"name": "dr1", "rid": "bf896718-4c69-41a0-bb95-9182e04a42ac"}
2023-12-12T16:15:34.357Z	INFO	controllers.DRCluster	controllers/drcluster_controller.go:298	reconcile exit	{"name": "dr1", "rid": "bf896718-4c69-41a0-bb95-9182e04a42ac"}
2023-12-12T16:15:34.357Z	ERROR	controller/controller.go:329	Reconciler error	{"controller": "drcluster", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRCluster", "DRCluster": {"name":"dr1"}, "namespace": "", "name": "dr1", "reconcileID": "33b6081b-3d31-48a0-866c-88ab6a038e7b", "error": "drclusters undeploy: drcluster 'dr1' referenced in one or more existing drPolicy resources"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235
  1. 7 minutes later - drplacement controller detects that the cluster was deleted
2023-12-12T16:23:29.231Z	INFO	controllers.DRPlacementControl	controllers/drplacementcontrol_controller.go:594	Entering reconcile loop	{"DRPC": "busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc", "rid": "2e059f82-a46a-48bd-b975-a0619a608aa7"}
...
2023-12-12T16:23:29.234Z	INFO	controllers.DRPlacementControl	controllers/drplacementcontrol_controller.go:1486	Skipping VRG on deleted drcluster	{"DRPC": "busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc", "rid": "2e059f82-a46a-48bd-b975-a0619a608aa7", "drcluster": "dr1", "vrg": "busybox-regional-rbd-deploy-drpc"}
  1. In the same reconcile loop - it updates the vrg manifestwork - including s3 store on the deleted cluster
2023-12-12T16:23:29.236Z	INFO	controllers.DRPlacementControl	controllers/drplacementcontrol.go:1606	ensureNamespaceExistsOnManagedCluster: namespace 'busybox-regional-rbd-deploy' exists on cluster dr2: true	{"DRPC": "busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc", "rid": "2e059f82-a46a-48bd-b975-a0619a608aa7"}
2023-12-12T16:23:29.236Z	INFO	controllers.DRPlacementControl	controllers/drplacementcontrol.go:1434	Creating VRG ManifestWork	{"DRPC": "busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc", "rid": "2e059f82-a46a-48bd-b975-a0619a608aa7", "Last State:": "FailedOver", "cluster": "dr2"}
2023-12-12T16:23:29.236Z	INFO	controllers.DRPlacementControl	util/mw_util.go:121	Create or Update manifestwork busybox-regional-rbd-deploy-drpc:busybox-regional-rbd-deploy:dr2:{TypeMeta:{Kind:VolumeReplicationGroup APIVersion:ramendr.openshift.io/v1alpha1} ObjectMeta:{Name:busybox-regional-rbd-deploy-drpc GenerateName: Namespace:busybox-regional-rbd-deploy SelfLink: UID: ResourceVersion: Generation:0 CreationTimestamp:0001-01-01 00:00:00 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Finalizers:[] ManagedFields:[]} Spec:{PVCSelector:{MatchLabels:map[appname:busybox] MatchExpressions:[]} ReplicationState:primary S3Profiles:[minio-on-dr2] Async:0xc000fe3a90 Sync:<nil> VolSync:{RDSpec:[] Disabled:false} PrepareForFinalSync:false RunFinalSync:false Action:Failover KubeObjectProtection:<nil>} Status:{State: ProtectedPVCs:[] Conditions:[] ObservedGeneration:0 LastUpdateTime:0001-01-01 00:00:00 +0000 UTC KubeObjectProtection:{CaptureToRecoverFrom:<nil>} PrepareForFinalSyncComplete:false FinalSyncComplete:false LastGroupSyncTime:<nil> LastGroupSyncDuration:nil LastGroupSyncBytes:<nil>}}	{"DRPC": "busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc", "rid": "2e059f82-a46a-48bd-b975-a0619a608aa7"}
2023-12-12T16:23:29.236Z	INFO	controllers.DRPlacementControl	util/mw_util.go:499	Updating Manifestwork	{"DRPC": "busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc", "rid": "2e059f82-a46a-48bd-b975-a0619a608aa7", "name": "busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mw", "namespace": "dr2"}
  1. Few milliseconds later - we update the VRG without the s3 profile of the failed cluster
2023-12-12T16:23:29.236Z	INFO	controllers.DRPlacementControl	util/mw_util.go:121	Create or Update manifestwork busybox-regional-rbd-deploy-drpc:busybox-regional-rbd-deploy:dr2:{TypeMeta:{Kind:VolumeReplicationGroup APIVersion:ramendr.openshift.io/v1alpha1} ObjectMeta:{Name:busybox-regional-rbd-deploy-drpc GenerateName: Namespace:busybox-regional-rbd-deploy SelfLink: UID: ResourceVersion: Generation:0 CreationTimestamp:0001-01-01 00:00:00 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Finalizers:[] ManagedFields:[]} Spec:{PVCSelector:{MatchLabels:map[appname:busybox] MatchExpressions:[]} ReplicationState:primary S3Profiles:[minio-on-dr2] Async:0xc000fe3a90 Sync:<nil> VolSync:{RDSpec:[] Disabled:false} PrepareForFinalSync:false RunFinalSync:false Action:Failover KubeObjectProtection:<nil>} Status:{State: ProtectedPVCs:[] Conditions:[] ObservedGeneration:0 LastUpdateTime:0001-01-01 00:00:00 +0000 UTC KubeObjectProtection:{CaptureToRecoverFrom:<nil>} PrepareForFinalSyncComplete:false FinalSyncComplete:false LastGroupSyncTime:<nil> LastGroupSyncDuration:nil LastGroupSyncBytes:<nil>}}	{"DRPC": "busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc", "rid": "2e059f82-a46a-48bd-b975-a0619a608aa7"}

return false
log.Info("Delete event")

return true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work, looks like FilterDRCluster is filtering only drcluster that are the failover clusters for some drpcs, including skiping for metro dr. This code is complex and we don't have enough time to modify it now.

I'll work on this later.

controllers/util/mw_util.go Outdated Show resolved Hide resolved
Copy link
Member

@ShyamsundarR ShyamsundarR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the use-case perspective of deleting a lost DRCluster and prior to that failing over workloads from the lost cluster to the surviving cluster the code changes are fine.

There are a set of things we need to take care in the future as follows:

  • (existing issue?) DRCluster delete should delete volsync addon created at the hub
  • We should fail to process newer DRPC instances created referencing a DRPolicy with a deleted DRCluster
    • e.g.
      // TODO: What if the DRCluster is deleted? If new DRPC fail reconciliation
      • We cannot error above, as existing DRPC would also detect the same and error out without processing newer actions (say failover from deleted cluster)
    • Need to do this at a point before we apply the finalizer to the DRPC instance, such that we refuse to process it at all
  • If current DRPC is Primary on deleted cluster, then secondary cluster resources may not get garbage collected; if DRPC is deleted in this state
    • RBD secondary images may get left behind
    • MetroDR case may have PVs left behind on the surviving primary cluster?
    • Volsync resources on Secondary should be cleaned up for such situations
    • Should we insist on failover before DRPC deletion? But how, DRPC is already deleted when detecting this and so further spec changes are not allowed. (unsure how to handle this at the moment)
  • (good to have) On DRPC deletion it can garbage collect the ManifestWork for the namespace as well on the deleted DRCluster to leave a cleaner setup
  • (tidier status) cleanupSecondaries may keep waiting for MCV to report VRG is deleted on the deleted DRCluster (which will never happen)
    • This leads to failedOver state, but DRPC stuck on cleaning up state, which can be avoided handling a deleted DRCluster in this scenario
    • Similarly cleanupForVolSync may also be stuck attempting to cleanup
    • When we handle the above we would need to also ensure we do not attempt to setup volsync again EnsureVolSyncReplicationSetup
  • Failover to deleted DRCluster should be prevented
  • Relocate to/from deleted cluster is currently allowed and should be prevented in both cases
  • createVRGManifestWorkAsPrimary MAY update VRG as primary from secondary (volsync case) and later updated with correct S3 stores in ensureVRGManifestWork
    • Can be corrected, or marked as known for future code reading to avoid the question

controllers/drplacementcontrol.go Outdated Show resolved Hide resolved
When creating s3 profile names, include only the s3 profiles from
available drclusters.

This affects generating a VRG for manifestwork. When validating failover
prerequisites we already use only the failover clsuter, and this cluster
cannot be deleted.

Another case of drcluster deploy/undeploy flow, which still use all
drclusters and may require more work.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
- More consistent logging - "Creating ManifestWork" or "Updating
  ManifestWork".
- Extract key variable to clean up the Client.Get() calls
- Eliminate unneeded temporary err and retryErr variables
- Remove commented code

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
When a drcluster is deleted, the VRG s3 profiles list changes, so we
need to update the manifestwork.

In ensureVRGManifestWork() we skipped updating the manifestwork if the
VRG exists, now we always update the manifestwork. This requires little
bit more work for every reconcile, but should be efficient enough since
the vrg and manifestwork are cached.

Issues:

- It is not clear if ensureVRGManifestWork() is safe only for primary
  vrg, and it callers are using it correctly. But if we had wrong
  callers before, this change increases the impact of a wrong call,
  updating existing VRG which was not updated before. Add a TODO to look
  at this later, so we can test replace cluster flow now.

- The VRG is updated few minutes after detecting a deleted drcluster,
  since we don't watch drcluster delete events, and we filter drclsuter
  events only for drpc failing over to updated clusters.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@ShyamsundarR ShyamsundarR merged commit b8bedb9 into RamenDR:main Dec 13, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disable DR when a cluster is not responsive
3 participants