Restic volume restore fails with ‘interrupted system call’ #3423
Replies: 1 comment
-
I'm seeing something similar with Velero 1.5.3 on an aks cluster using k8s 1.19.7 time="2021-03-01T17:56:31Z" level=error msg="unable to successfully complete restic restores of pod's volumes" error="pod volume restore failed: error restoring volume: error writing done file: open /host_pods/7a3a9075-138b-4227-9dae-4c37bc2841c0/volumes/kubernetes.io~azure-file/pvc-d89bdd43-1cfd-4e95-9bcf-59d504a73fac/.velero/1a02afd1-9612-4827-ae10-ea5742c3b83e: interrupted system call" logSource="pkg/restore/restore.go:1296" restore=... I ran the restore several times, and on the 4th attempt the restore completed without errors. The failed restores had different pods failing. My observations suggest that this is only happening for pods using RWX PVs. Is this a race condition caused by simultaneous attempts to restore a given volume on behalf of multiple pods? Is it a benign error (volume will have been restored successfully on behalf of one pod) or does it prevent non-shared PVs from being restored, e.g. is it the case that: pod1: shared-data "shared-data" is restored on behalf of pod1 |
Beta Was this translation helpful? Give feedback.
-
We have been using velero successfully for a while, with a handful of Azure File volumes, but we ran into problems when we increased our cluster to 78 nodes. More specifically, velero restore is failing.
Here are a few facts:
Error restoring volume: error writing done file: open /host_pods/<someid>/volumes/kubernetes.io~azure-file/<some-pvc-name>/.velero/<someid>: interrupted system call
I tried some troubleshooting using instructions from Restic Integration but was not able to obtain a more descriptive message than the above error. Increasing debugging level is not an option given the sheer amount of text produced by these logs (combined with the large number of nodes in our system).
Any help would be greatly appreciated.
Correction to the above post:
The failure has nothing to do with the cluster size. The only other significant change that I occurred at the same time as the scaling up of our cluster was the k8s upgrade from 1.17.13 to 1.18.10.
Still looking for a solution…
Beta Was this translation helpful? Give feedback.
All reactions