Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf improvements for existing resource restore #6723

Merged
merged 1 commit into from
Oct 12, 2023

Conversation

sseago
Copy link
Collaborator

@sseago sseago commented Aug 30, 2023

Use informer cache with dynamic client for Get calls on restore
When enabled, also make the Get call before create attempt.

Add server and install parameter to allow disabling this feature, but enable by default

This change results in significant restore performance improvements in cases where resources to be restored already exist in the cluster. As a test example, backed up a namespace with 31k secrets in it (the number was chosen to approximately match the total number of items in a real backup of a user who was concerned with restore performance numbers for the existing resource case).

Without this change:
restore into new namespace: 52 minutes
restore with all resources existing (existingResourcePolicy=none): 52 minutes
restore with all resources existing (existinResourcePolicy=update): 1 hour, 18 minutes

With this change:
restore into new namespace: 52 minutes
restore with all resources existing (existingResourcePolicy=none): 26 minutes
restore with all resources existing (existinResourcePolicy=update): 26 minutes

Does your change fix a particular issue?

Fixes #(issue)

Please indicate you've done the following:

  • [x ] Accepted the DCO. Commits without the DCO will delay acceptance.
  • [x ] Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

@sseago sseago marked this pull request as draft August 30, 2023 16:01
@sseago sseago marked this pull request as ready for review August 30, 2023 16:08
@github-actions github-actions bot requested a review from blackpiglet August 30, 2023 16:08
@sseago sseago added the 1.13-candidate issue/pr that should be considered to target v1.13 minor release label Aug 30, 2023
@sseago sseago force-pushed the restore-get-perf branch 2 times, most recently from 7c07591 to e2f2d83 Compare August 30, 2023 16:45
@codecov
Copy link

codecov bot commented Aug 30, 2023

Codecov Report

Merging #6723 (7750e12) into main (b316101) will increase coverage by 0.11%.
The diff coverage is 88.72%.

@@            Coverage Diff             @@
##             main    #6723      +/-   ##
==========================================
+ Coverage   60.69%   60.81%   +0.11%     
==========================================
  Files         249      249              
  Lines       26493    26607     +114     
==========================================
+ Hits        16081    16180      +99     
- Misses       9268     9278      +10     
- Partials     1144     1149       +5     
Files Coverage Δ
pkg/controller/restore_controller.go 66.09% <100.00%> (+0.12%) ⬆️
pkg/install/deployment.go 88.79% <100.00%> (+0.33%) ⬆️
pkg/restore/request.go 100.00% <ø> (ø)
pkg/cmd/server/server.go 22.17% <66.66%> (+0.17%) ⬆️
pkg/client/dynamic.go 0.00% <0.00%> (ø)
pkg/install/resources.go 75.39% <0.00%> (-0.90%) ⬇️
pkg/restore/restore.go 65.93% <91.74%> (+1.54%) ⬆️

@weshayutin
Copy link
Contributor

@shubham-pampattiwar @kaovilai please review as you have time :)

pkg/test/fake_dynamic.go Outdated Show resolved Hide resolved
pkg/cmd/server/server.go Outdated Show resolved Hide resolved
pkg/cmd/server/server.go Outdated Show resolved Hide resolved
@Lyndon-Li
Copy link
Contributor

@sseago
Could you help to open an issue for this PR, since the 1.13 tasks are tracked by issues? We still need some time to review this PR, an issue could help us timely track the review status and priority.

@reasonerjt reasonerjt added this to the v1.13 milestone Sep 13, 2023
blackpiglet
blackpiglet previously approved these changes Sep 14, 2023
@sseago
Copy link
Collaborator Author

sseago commented Sep 19, 2023

@Lyndon-Li issue is here: #6845

@reasonerjt reasonerjt removed the 1.13-candidate issue/pr that should be considered to target v1.13 minor release label Sep 20, 2023
Lyndon-Li
Lyndon-Li previously approved these changes Sep 28, 2023
pkg/restore/restore.go Outdated Show resolved Hide resolved
pkg/restore/restore.go Outdated Show resolved Hide resolved
@@ -120,6 +121,7 @@ func (o *Options) BindFlags(flags *pflag.FlagSet) {
flags.BoolVar(&o.DefaultVolumesToFsBackup, "default-volumes-to-fs-backup", o.DefaultVolumesToFsBackup, "Bool flag to configure Velero server to use pod volume file system backup by default for all volumes on all backups. Optional.")
flags.StringVar(&o.UploaderType, "uploader-type", o.UploaderType, fmt.Sprintf("The type of uploader to transfer the data of pod volumes, the supported values are '%s', '%s'", uploader.ResticType, uploader.KopiaType))
flags.BoolVar(&o.DefaultSnapshotMoveData, "default-snapshot-move-data", o.DefaultSnapshotMoveData, "Bool flag to configure Velero server to move data by default for all snapshots supporting data movement. Optional.")
flags.BoolVar(&o.UseInformerCacheForGet, "use-informer-cache-for-get", o.UseInformerCacheForGet, "Use informer cache for Get calls on restore. This will speed up restore in cases where there are backup resources which already exist in the cluster, but for very large clusters this will increase velero memory usage. Default is true. Optional.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give a user friendly name for this flag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lyndon-Li @kaovilai I agree it's a bit awkward. But I couldn't come up with a better name at the time. Is shortening it to "use-informer-cache" OK? Maybe the "for get" is implied, since that's really the only call where the informer cache helps us much. I'll make that change unless someone comes up with a better name before I get to it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah use-informer-cache sounds good, or maybe enable-informer-cache ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Actually, since we're defaulting it to enabled, maybe we should call it disable-informer-cache and if false, we use informer cache, if true we don't?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah works for me.

@sseago
Copy link
Collaborator Author

sseago commented Oct 3, 2023

Updated based on review feedback.

@sseago sseago force-pushed the restore-get-perf branch 2 times, most recently from cf7904f to 930f0dd Compare October 3, 2023 21:12
@@ -122,6 +123,7 @@ func (o *Options) BindFlags(flags *pflag.FlagSet) {
flags.BoolVar(&o.DefaultVolumesToFsBackup, "default-volumes-to-fs-backup", o.DefaultVolumesToFsBackup, "Bool flag to configure Velero server to use pod volume file system backup by default for all volumes on all backups. Optional.")
flags.StringVar(&o.UploaderType, "uploader-type", o.UploaderType, fmt.Sprintf("The type of uploader to transfer the data of pod volumes, the supported values are '%s', '%s'", uploader.ResticType, uploader.KopiaType))
flags.BoolVar(&o.DefaultSnapshotMoveData, "default-snapshot-move-data", o.DefaultSnapshotMoveData, "Bool flag to configure Velero server to move data by default for all snapshots supporting data movement. Optional.")
flags.BoolVar(&o.DisableInformerCache, "disable-informer-cache", o.DisableInformerCache, "Disable informer cache for Get calls on restore. With this enabled, it will speed up restore in cases where there are backup resources which already exist in the cluster, but for very large clusters this will increase velero memory usage. Default is false (don't disable). Optional.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cobra already append default info to help

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually no it doesnt. it just appends what type things are.. ie string or bool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so better to keep the default text, thanks @kaovilai !

@Lyndon-Li
Copy link
Contributor

@sseago Could you help to rebase from the main branch? The codespell errors have been fixed in the latest code.

Use informer cache with dynamic client for Get calls on restore
When enabled, also make the Get call before create.

Add server and install parameter to allow disabling this feature,
but enable by default

Signed-off-by: Scott Seago <sseago@redhat.com>
@sseago
Copy link
Collaborator Author

sseago commented Oct 11, 2023

I set DCO to pass because the errors that showed up weren't for my PR.

Commit sha: [7f73aca](https://github.com/vmware-tanzu/velero/pull/6723/commits/7f73acab1644f7170779b7225a5d7336b20f0868), Author: Guang Jiong Lou, Committer: GitHub; Can not find "Guang Jiong Lou [7991675+27149chen@users.noreply.github.com](mailto:7991675+27149chen@users.noreply.github.com)", in ["lou [alex1988@outlook.com](mailto:alex1988@outlook.com)", "lou [alex1988@outlook.com](mailto:alex1988@outlook.com)", "lou [alex1988@outlook.com](mailto:alex1988@outlook.com)", "lou [alex1988@outlook.com](mailto:alex1988@outlook.com)", "lou [alex1988@outlook.com](mailto:alex1988@outlook.com)", "lou [alex1988@outlook.com](mailto:alex1988@outlook.com)", "lou [alex1988@outlook.com](mailto:alex1988@outlook.com)"].
Commit sha: [e5e99c7](https://github.com/vmware-tanzu/velero/pull/6723/commits/e5e99c75a0cb2e0958fabd54707db82137073b99), Author: Yang Gang, Committer: GitHub; Expected "Yang Gang [gang.yang@daocloud.io](mailto:gang.yang@daocloud.io)", but got "yanggang [gang.yang@daocloud.io](mailto:gang.yang@daocloud.io)".

@sseago
Copy link
Collaborator Author

sseago commented Oct 11, 2023

@Lyndon-Li done

@shubham-pampattiwar shubham-pampattiwar merged commit ad114f8 into vmware-tanzu:main Oct 12, 2023
45 checks passed
@weshayutin
Copy link
Contributor

@sseago I believe we wanted to CP this to v.1.11 for oadp-1.2

@kaovilai
Copy link
Member

kaovilai commented Jan 8, 2024

@weshayutin @sseago keep in mind that there's an issue with the functionality noted in #7263

@sseago
Copy link
Collaborator Author

sseago commented Jan 8, 2024

@kaovilai Looks like the bug in #7263 was only introduced in 1.13 development, although that's been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants