Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node agent restart enhancement #7130

Merged
merged 1 commit into from
Nov 28, 2023

Conversation

qiuming-best
Copy link
Contributor

@qiuming-best qiuming-best commented Nov 21, 2023

Thank you for contributing to Velero!

Please add a summary of your change

Enhance the recoverability of backup & restore in data-mover

  • We could redo the data upload or data download which CRs in the Prepared phases when the node-agent restarts instead of canceling directly.
  • For those accepted CRs, we can retrieve them by label velero.io/accepted-by and cancel it when node-agent restarts.

Does your change fix a particular issue?

Fixes #(issue)
#6727

Please indicate you've done the following:

  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

@qiuming-best qiuming-best marked this pull request as draft November 21, 2023 02:30
@qiuming-best qiuming-best force-pushed the data-mover-recoverbility branch from 780edca to 51bb5f6 Compare November 21, 2023 07:34
@qiuming-best qiuming-best force-pushed the data-mover-recoverbility branch 2 times, most recently from 294d16a to cc9b9b1 Compare November 21, 2023 07:41
Copy link

codecov bot commented Nov 21, 2023

Codecov Report

Attention: 74 lines in your changes are missing coverage. Please review.

Comparison is base (e58a780) 61.19% compared to head (98a56eb) 61.70%.
Report is 21 commits behind head on main.

Files Patch % Lines
pkg/controller/data_download_controller.go 41.07% 28 Missing and 5 partials ⚠️
pkg/controller/data_upload_controller.go 40.00% 28 Missing and 5 partials ⚠️
pkg/cmd/cli/nodeagent/server.go 0.00% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7130      +/-   ##
==========================================
+ Coverage   61.19%   61.70%   +0.50%     
==========================================
  Files         258      258              
  Lines       27366    27740     +374     
==========================================
+ Hits        16747    17117     +370     
+ Misses       9434     9420      -14     
- Partials     1185     1203      +18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@qiuming-best qiuming-best force-pushed the data-mover-recoverbility branch from cc9b9b1 to e7e5da4 Compare November 21, 2023 07:52
@qiuming-best qiuming-best marked this pull request as ready for review November 21, 2023 08:18
@qiuming-best qiuming-best force-pushed the data-mover-recoverbility branch from e7e5da4 to 3df177e Compare November 27, 2023 02:26
@anshulahuja98
Copy link
Collaborator

I am thinking if this could lead to other issues.

if you think restart of node agent is likely when it is running out of mem / cpu due to high load in backing up/restoring certain large volumes

Here since today we cancel it, we atleast terminate and fail early. With your changes if we re-put, we might again cause the node agent to restart. This might be a vicious cycle.

@qiuming-best
Copy link
Contributor Author

I am thinking if this could lead to other issues.

if you think restart of node agent is likely when it is running out of mem / cpu due to high load in backing up/restoring certain large volumes

Here since today we cancel it, we atleast terminate and fail early. With your changes if we re-put, we might again cause the node agent to restart. This might be a vicious cycle.

yes, What you considered is really a problem. we should cancel the in progress CR directly

@qiuming-best qiuming-best force-pushed the data-mover-recoverbility branch 2 times, most recently from d6807d4 to 56233a4 Compare November 27, 2023 07:42
err = UpdateDataDownloadWithRetry(ctx, cli, types.NamespacedName{Namespace: dd.Namespace, Name: dd.Name},
r.logger.WithField("dataupload", dd.Name), func(dataDownload *velerov2alpha1api.DataDownload) {
dataDownload.Spec.Cancel = true
dataDownload.Status.Message = fmt.Sprintf("found a datadownload %s/%s is being deleted, mark it as cancel", dd.Namespace, dd.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message is not correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, Updated

if err != nil {
r.logger.WithError(err).Errorf("failed to set cancel flag with error %s", err.Error())
}
r.logger.Warn(dd.Status.Message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this log ahead, otherwise, when UpdateDataDownloadWithRetry returns an error, failed to set cancel flag with error %s will be first printed, which is confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

})

if err != nil {
logger.WithError(errors.WithStack(err)).Errorf("failed to mark datadownload %q into prepared", dd.GetName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message, into prepared should be into cancelled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

r.logger.WithError(errors.WithStack(err)).Errorf("failed to mark dataupload %q cancel", du.GetName())
continue
}
r.logger.WithField("dataupload", du.GetName()).Warn(du.Status.Message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same, move it ahead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

})

if err != nil {
logger.WithError(errors.WithStack(err)).Errorf("failed to mark dataupload %q into prepared", du.GetName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

into cancelled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

logger.WithError(errors.WithStack(err)).Errorf("failed to mark datadownload %q into prepared", dd.GetName())
continue
}
logger.WithField("datadownload", dd.GetName()).Debug("mark datadownload into prepared")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

into cancelled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

logger.WithError(errors.WithStack(err)).Errorf("failed to mark dataupload %q into prepared", du.GetName())
continue
}
logger.WithField("dataupload", du.GetName()).Debug("mark dataupload into prepared")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

into cancelled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@qiuming-best qiuming-best force-pushed the data-mover-recoverbility branch 2 times, most recently from 4a01c2d to 17e3354 Compare November 28, 2023 01:35
// CancelAcceptedDataDownload will cancel the accepted data download
func (r *DataDownloadReconciler) CancelAcceptedDataDownload(ctx context.Context, cli client.Client, ns string) {
r.logger.Infof("Canceling accepted data for node %s", r.nodeName)
dataDownloads, err := r.findAcceptDataDownloadsByNodeLable(ctx, cli, ns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: label

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated

Signed-off-by: Ming Qiu <mqiu@vmware.com>
@qiuming-best qiuming-best force-pushed the data-mover-recoverbility branch from 17e3354 to 98a56eb Compare November 28, 2023 05:50
@ywk253100 ywk253100 merged commit 6ac7ff1 into vmware-tanzu:main Nov 28, 2023
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants