Skip to content

Commit

Permalink
ManifestWork Regeneration Bug and ACM Eviction Consequences
Browse files Browse the repository at this point in the history
Upon hub recovery, it is essential to regenerate the VRG ManifestWorks. For CephFS workloads,
two VRG ManifestWorks are necessary—one for the primary cluster and another for the secondary
cluster. However, Ramen failed to regenerate the secondary VRG ManifestWork due to a bug. The
bug originated from a condition that determined whether to regenerate the ManifestWork based
solely on the number of VRGs found on the managed cluster. Consequently, if the count met the
required number, the regeneration was bypassed without considering the actual count of
ManifestWorks. This oversight led to ACM evicting the VRG from the secondary cluster after
the default one-hour eviction time.

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
  • Loading branch information
Benamar Mekhissi committed Dec 4, 2023
1 parent d35f8bd commit 29cdd89
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 1 deletion.
4 changes: 3 additions & 1 deletion controllers/drplacementcontrolvolsync.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,10 @@ func (d *DRPCInstance) ensureVolSyncReplicationCommon(srcCluster string) error {
// Make sure we have Source and Destination VRGs - Source should already have been created at this point
d.setProgression(rmn.ProgressionEnsuringVolSyncSetup)

vrgMWCount := d.mwu.GetVRGManifestWorkCount(rmnutil.DrpolicyClusterNames(d.drPolicy))

const maxNumberOfVRGs = 2
if len(d.vrgs) != maxNumberOfVRGs {
if len(d.vrgs) != maxNumberOfVRGs || vrgMWCount != maxNumberOfVRGs {
// Create the destination VRG
err := d.createVolSyncDestManifestWork(srcCluster)
if err != nil {
Expand Down
14 changes: 14 additions & 0 deletions controllers/util/mw_util.go
Original file line number Diff line number Diff line change
Expand Up @@ -530,6 +530,20 @@ func (mwu *MWUtil) createOrUpdateManifestWork(
return nil
}

func (mwu *MWUtil) GetVRGManifestWorkCount(drClusters []string) int {
count := 0
for _, clusterName := range drClusters {
_, err := mwu.FindManifestWorkByType(MWTypeVRG, clusterName)
if err != nil {
continue
}

count++
}

return count
}

func (mwu *MWUtil) DeleteManifestWorksForCluster(clusterName string) error {
// VRG
err := mwu.deleteManifestWorkWrapper(clusterName, MWTypeVRG)
Expand Down

0 comments on commit 29cdd89

Please sign in to comment.