Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support detect k8s resource dependency during backup #7199

Open
blackpiglet opened this issue Dec 11, 2023 · 11 comments
Open

Support detect k8s resource dependency during backup #7199

blackpiglet opened this issue Dec 11, 2023 · 11 comments
Labels
2024 Q1 reviewed backlog Icebox We see the value, but it is not slated for the next couple releases. Kubernetes Resources Pertains to backup/restoration of Kubernetes resources

Comments

@blackpiglet
Copy link
Contributor

blackpiglet commented Dec 11, 2023

Describe the problem/challenge you have

It's better to know the backed-up k8s resource's dependency.
If the Velero server knows it, it can detect invalid backups before running the backup process.
This feature can help to resolve the scenario described in PR #7045.

This feature also has benefits for

  • Running multiple backups parallelly (check whether the backups have resource overlapping)
  • Supporting advanced resource restore sequences.
  • Backup/Restore pause and resume.

Describe the solution you'd like

The Velero server can use a DAG(Directed Acyclic Graphs) as the data structure to store the backup resources.

The DAG's content should be:

  • The DAG could have a root node, which is the backup itself.
  • The children of the DAG's root node should be the resources not relying on other k8s resources, e.g. CRD, namespaces, StorageClasses, and VolumeSnapshotClasses.
  • The node of the DAG could have multiple parents and multiple children.

Say this string represents a DAG, The resource backup sequence should ordered from left to right.

e > f, g > h;

The DAG should be generated by existing rules:

  • The Velero server's high-priority and low-priority resource settings.
  • Owner Reference rule.
  • The potential user-provided rules(may need a new CRD here).

During generating the rules, if the later rules violate the existing DAG resource hierarchy, fail the backup, and warn the user the rule is invalid.

When taking the backup, it should start from the root node, and go through the root node's children. After that, traverse the children's children. If backup gets a resource, but the resource's parents are not all backed up yet, the Velero server should put it on hold, and go on, then the Velero server should retry with the on-hold resources before traversing the next layer of resources.

Anything else you would like to add:

Environment:

  • Velero version (use velero version):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "The project would be better with this feature added"
  • 👎 for "This feature will not enhance the project in a meaningful way"
@blackpiglet blackpiglet added pv-backup-info 1.14-candidate Kubernetes Resources Pertains to backup/restoration of Kubernetes resources and removed pv-backup-info labels Dec 11, 2023
@blackpiglet
Copy link
Contributor Author

I put some consideration to the BIA scenario here.
The BIA is different because the Velero collects all cared resources at the start of the backup. Still, the BIAs are executed during the backup process, and return more additional resources also included in the backup.

That means the Velero server cannot determine the whole scope of the backup, until the backup finishes. That makes supporting parallel backups not possible because the Velero server cannot detect the backups' overlap and potential conflicts.

There is also some discussion about making the BIA add a new method to return the additional resources during the backup resource collecting stage. I think it cannot resolve the issue.

  • First, many additional resources the BIAs care about are created during the BIA running, so it's not possible to know the additional resources before that, and I think it's safe for the parallel backup scenario because they will not cause any resource overlap.
  • Second, the Velero server should do nothing other than archive the additional resources' YAML into the metadata file. Even the additional resources already existed in the metadata before the BIAs returned it, which should not do any harm to the other parallel backups.

I think the real problem BIA caused is that the Velero server cannot know what the BIAs do. If the BIA freezes the filesystem of a pod that is not included in the backup, although IMO it shouldn't happen, it will impact parallel filesystem backups.

Unfortunately, as an external binary, it's not possible to regulate the plugins' behavior.
IMO, we can only give a guideline of how the plugins should work to make the parallel backups work.

@reasonerjt reasonerjt changed the title Support detect k8s resource dependency Support detect k8s resource dependency during backup Feb 6, 2024
@reasonerjt
Copy link
Contributor

I think how to define "dependency" is a topic may cause a lot of debating, and is very complicated considering the customer resource.
As for the data structure to track the dependency, there's a design that has been merged:
https://github.com/vmware-tanzu/velero/blob/main/design/graph-manifest.md
We may consider use this data structure to solve specific problems, instead of trying to introduce a generic approach to handle all resources.

Marking this as "ice-box" as we may need more concrete use cases and handle them separately.

Copy link

github-actions bot commented Apr 8, 2024

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@github-actions github-actions bot added the staled label Apr 8, 2024
@blackpiglet
Copy link
Contributor Author

Not stale.

@blackpiglet blackpiglet removed the staled label Apr 8, 2024
Copy link

github-actions bot commented Jun 9, 2024

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@github-actions github-actions bot added the staled label Jun 9, 2024
@kaovilai
Copy link
Member

unstale

@github-actions github-actions bot removed the staled label Jun 11, 2024
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@blackpiglet
Copy link
Contributor Author

unstale

@github-actions github-actions bot removed the staled label Aug 13, 2024
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@kaovilai
Copy link
Member

unstale

@github-actions github-actions bot removed the staled label Oct 15, 2024
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024 Q1 reviewed backlog Icebox We see the value, but it is not slated for the next couple releases. Kubernetes Resources Pertains to backup/restoration of Kubernetes resources
Projects
None yet
Development

No branches or pull requests

3 participants