-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore fails due to insufficient ephemeral node storage #7499
Comments
The DataDownload has not actually completed at the time of cancallation: Most probably, the node-agent was OOM killed, so you can try to increase the memory limit of node-agent and retry. If DataDownload is not completed, the restore should not be marked as Completed. Could you double confirm this? |
Hi @Lyndon-Li! So technically the status of the restore is Warnings:
Velero: <none>
Cluster: could not restore, CustomResourceDefinition "sealedsecrets.bitnami.com" already exists. Warning: the in-cluster version is different than the backed-up version
Namespaces:
my-namespace: could not restore, ConfigMap "kube-root-ca.crt" already exists. Warning: the in-cluster version is different than the backed-up version
Errors:
Velero: <none>
Cluster: <none>
Namespaces: <none>
My bad, actually there's a line at the end of the describe like: Restore Item Operations: 2 of 3 completed successfully, 1 failed (specify --details for more information) However I still think this should be also mentioned in the But you are probably right about the node-agent, how can I further debug this to make sure that's the root cause? Mind you this restoration process worked for the very first time, but never since, always failing with the same PV, while restoring every other PV. |
Okay I've checked the node agents, and indeed I could observer OOM killed statuses. I've increased the memory limit to a much higher value and rerunning the restore, will let you know how it worked out. |
After increasing the resource limits the node-agent was not killed due to OOM but the restore failed again. I checked the logs and I could only see these lines: time="2024-03-07T14:10:13Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:10:13Z" level=warning msg="failed to add finalizer for velero/my-backup-20240307151012-cl925 and will requeue later" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:134"
time="2024-03-07T14:10:13Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:10:13Z" level=info msg="Data download starting" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:151"
time="2024-03-07T14:10:13Z" level=info msg="Accepting data download my-backup-20240307151012-cl925" controller=DataDownload logSource="pkg/controller/data_download_controller.go:659"
time="2024-03-07T14:10:13Z" level=info msg="This datadownload has been accepted by others" DataDownload=my-backup-20240307151012-cl925 controller=DataDownload logSource="pkg/controller/data_download_controller.go:689"
time="2024-03-07T14:10:13Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:10:30Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:10:31Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:10:31Z" level=info msg="Data download is prepared" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:227"
time="2024-03-07T14:20:33Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:20:33Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:20:45Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:20:45Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:20:59Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:20:59Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:21:09Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:21:09Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:21:38Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:21:38Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:21:50Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:21:50Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:22:00Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:22:00Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:22:34Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:22:34Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:22:52Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:22:52Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:23:06Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:23:06Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:23:39Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:23:39Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:24:33Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:24:33Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:24:50Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:24:50Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:25:03Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:25:03Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:25:48Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:25:48Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:26:51Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:26:51Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:27:10Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:27:10Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:36:02Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102"
time="2024-03-07T14:36:02Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:286"
time="2024-03-07T14:36:02Z" level=info msg="Data download is being canceled" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:288"
time="2024-03-07T14:36:02Z" level=warning msg="Async fs backup data path canceled" controller=DataDownload datadownload=my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:399"
time="2024-03-07T14:36:02Z" level=warning msg="failed to delete pvc velero/my-backup-20240307151012-cl925 with err persistentvolumeclaims \"my-backup-20240307151012-cl925\" not found" logSource="pkg/util/kube/pvc_pv.go:73"
time="2024-03-07T14:36:02Z" level=info msg="Reconcile my-backup-20240307151012-cl925" controller=datadownload datadownload=velero/my-backup-20240307151012-cl925 logSource="pkg/controller/data_download_controller.go:102" I cannot figure why it is being cancelled. I'll increase resources to do a last attempt, but according to metrics the usage is within limits, so can you please suggest what can go wrong here? |
By digging deeper in the related DataDownload resource: $ k describe datadownloads.velero.io my-backup-20240307151012-cl925
Name: my-backup-20240307151012-cl925
Namespace: velero
Labels: velero.io/accepted-by=d6c37003-a9f8-45db-ba7e-e965bd8136d8
velero.io/async-operation-id=dd-36e7ca5e-3ff9-48f8-ac4c-42483370ec30.cdcd4260-b15a-4148daa0e
velero.io/restore-name=my-backup-20240307151012
velero.io/restore-uid=36e7ca5e-3ff9-48f8-ac4c-42483370ec30
Annotations: <none>
API Version: velero.io/v2alpha1
Kind: DataDownload
Metadata:
Creation Timestamp: 2024-03-07T14:10:13Z
Generate Name: my-backup-20240307151012-
Generation: 22
Owner References:
API Version: velero.io/v1
Controller: true
Kind: Restore
Name: my-backup-20240307151012
UID: 36e7ca5e-3ff9-48f8-ac4c-42483370ec30
Resource Version: 37706165
UID: dacd13c6-e0f2-4c53-bdb8-addbc1b28592
Spec:
Backup Storage Location: default
Cancel: true
Operation Timeout: 10m0s
Snapshot ID: 6f9d4868fec1571c92b2f9c54a1e4562
Source Namespace: my-namespace
Target Volume:
Namespace: my-namespace
Pv:
Pvc: elasticsearch-data-pvc
Status:
Completion Timestamp: 2024-03-07T14:36:02Z
Message: found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel
Node: 2cefba45-3cae-46c4-bbde-a737212e14d3
Phase: Canceled
Progress:
Bytes Done: 42874466070
Total Bytes: 45149812450
Start Timestamp: 2024-03-07T14:10:13Z
Events: <none> The line Thoughts? |
"found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel" |
Okay I checked the node-agents again and seems like they get evicted one by one and then recreated. So I checked the status of the evicted ones: Status: Failed
Reason: Evicted
Message: Pod The node had condition: [DiskPressure]. Made me to think about the ephemeral storage size of my nodes which is normally 50G, and the PV I'm trying to restore is close to 50G. I'm going to double the ephemeral storage and retry the restore. The weird thing is that the very first time the restore worked without an issue, however it keeps failing consistently ever since for the very same PV, so there might be some cleanup issues associated with failed restores? |
Alright so the root cause was that the node of the node-agent was running out of temporary storage, which is weird as I wouldn't expect 50 GB to be insufficient for the operation of the restore. Is it required to have at least as much temporary node storage as the size of the backup to be restored? It's just a guess but I think the restore process is leaking storage somehow. Can you please check? |
The answer is No
It is known that kopia will persist some data locally as cache, so the disk usage can vary based on the number of namespaces the node agent handles.. |
Below are several information about disk cache:
@Kamforka As described above, the disk space required has nothing to do with the volume size to be restored, or the total size of the node storage. It is only related to the free space of the node storage, if the free space is not enough, the node storage would be run out and node would be evicted. To further troubleshoot, please check which of the 4 points you case has fallen into, specifically:
|
Hi everyone, Thanks for the information. Sorry for not following up on them immediately but I was doing a couple of test restores today and I still think something is not quite right about the ephemeral storage consumption of the node agents. I was running a single restore on a single namespace. I was monitoring the node-agent which was restoring my main PV. This is the historical ephemeral storage usage of the pod Every 1.0s: kubectl get --raw "/api/v1/nodes/6bb767d7-05b0-4467-b2c7-dd61e2d6af47/proxy/stats/summary" | jq '.pods[] | select(.podRef.name == "node-agent-pvx47")."ephemeral-storage"'
{
"time": "2024-03-11T10:33:44Z",
"availableBytes": 95836561408,
"capacityBytes": 105490616320,
"usedBytes": 19115122688,
"inodesFree": 6522280,
"inodes": 6553600,
"inodesUsed": 13894
}
Every 1.0s: kubectl get --raw "/api/v1/nodes/6bb767d7-05b0-4467-b2c7-dd61e2d6af47/proxy/stats/summary" | jq '.pods[] | select(.podRef.name == "node-agent-pvx47")."ephemeral-storage"'
{
"time": "2024-03-11T10:37:33Z",
"availableBytes": 95836184576,
"capacityBytes": 105490616320,
"usedBytes": 28821803008,
"inodesFree": 6522280,
"inodes": 6553600,
"inodesUsed": 17565
}
Every 1.0s: kubectl get --raw "/api/v1/nodes/6bb767d7-05b0-4467-b2c7-dd61e2d6af47/proxy/stats/summary" | jq '.pods[] | select(.podRef.name == "node-agent-pvx47")."ephemeral-storage"'
{
"time": "2024-03-11T10:38:29Z",
"availableBytes": 95836176384,
"capacityBytes": 105490616320,
"usedBytes": 38970085376,
"inodesFree": 6522278,
"inodes": 6553600,
"inodesUsed": 20513
}
Every 1.0s: kubectl get --raw "/api/v1/nodes/6bb767d7-05b0-4467-b2c7-dd61e2d6af47/proxy/stats/summary" | jq '.pods[] | select(.podRef.name == "node-agent-pvx47")."ephemeral-storage"'
{
"time": "2024-03-11T10:39:54Z",
"availableBytes": 95836143616,
"capacityBytes": 105490616320,
"usedBytes": 50268405760,
"inodesFree": 6522277,
"inodes": 6553600,
"inodesUsed": 18073
} As you can see minute by minute the consumption is sky rocketing: 19G -> 28.8G -> 38.9G -> 50G This time it was not triggering a warning but previously I could monitor the below events in the corresponding nodes: Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning EvictionThresholdMet 42m kubelet Attempting to reclaim ephemeral-storage
Normal NodeHasDiskPressure 42m kubelet Node 6bb767d7-05b0-4467-b2c7-dd61e2d6af47 status is now: NodeHasDiskPressure
Normal NodeHasNoDiskPressure 37m (x3 over 3d2h) kubelet Node 6bb767d7-05b0-4467-b2c7-dd61e2d6af47 status is now: NodeHasNoDiskPressure This doesn't seem right to me, any thoughts? |
There is an issue in Velero that the repo's data cache limit is not correctly set, as a result, the repo's cache control behavior is the cached data is swept when its age is older than 10min. As a result, the cache data may accumulate to a high value without any control in this 10min. We will fix this problem. |
Here is a conclusion after a further discussion:
Let us know you feedbacks. cc @Kamforka |
What steps did you take and what happened:
velero restore create --from-backup my-backup --restore-volumes=true --wait
then
velero restore describe my-backup-20240305145411 --details
What did you expect to happen:
I expected to make a backup of a namespace (which was successful), delete the namespace and then restore it.
When I make the restore it finishes after some time as completed, but the restored namespace has a PV which will never be created because its data download was cancelled for an unknown reason (as you can see from the detailed describe).
This will corrupt the recreation/bounding of the restored PVC and thus the whole namespace.
The following information will help us better understand what's going on:
Unfortunately I cannot provide the logs due to confidentiality reasons.
Anything else you would like to add:
Environment:
velero version
): v1.13.0velero client config get features
): features:kubectl version
): client/v1.29.1; server/v1.24.14+vmware.1/etc/os-release
): Ubuntu 22.04.3 LTSVote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: