Restic volume restore fails with ‘interrupted system call’ #3423

jbmassicotte · 2021-01-25T23:30:28Z

jbmassicotte
Jan 25, 2021

We have been using velero successfully for a while, with a handful of Azure File volumes, but we ran into problems when we increased our cluster to 78 nodes. More specifically, velero restore is failing.

Here are a few facts:

We are running velero v1.5.1 on Azure k8s version 1.18.10.
Velero backup succeeds, as reported by the restic logs in ‘velero backup describe … --details’. There are 78 restic volumes in the backups.
During the restore, restic failed to restore 16 volumes of the 78 volumes, resulting in 16 pods in non-running state.
Only restic volumes (ie, AzureFiles) are affected
The errors shown in ‘velero restore describe … --details’ are all the same:

Error restoring volume: error writing done file: open /host_pods/<someid>/volumes/kubernetes.io~azure-file/<some-pvc-name>/.velero/<someid>: interrupted system call

I tried some troubleshooting using instructions from Restic Integration but was not able to obtain a more descriptive message than the above error. Increasing debugging level is not an option given the sheer amount of text produced by these logs (combined with the large number of nodes in our system).

Any help would be greatly appreciated.

Correction to the above post:

The failure has nothing to do with the cluster size. The only other significant change that I occurred at the same time as the scaling up of our cluster was the k8s upgrade from 1.17.13 to 1.18.10.

Still looking for a solution…

JonDGH · 2021-03-02T09:58:58Z

JonDGH
Mar 2, 2021

I'm seeing something similar with Velero 1.5.3 on an aks cluster using k8s 1.19.7

time="2021-03-01T17:56:31Z" level=error msg="unable to successfully complete restic restores of pod's volumes" error="pod volume restore failed: error restoring volume: error writing done file: open /host_pods/7a3a9075-138b-4227-9dae-4c37bc2841c0/volumes/kubernetes.io~azure-file/pvc-d89bdd43-1cfd-4e95-9bcf-59d504a73fac/.velero/1a02afd1-9612-4827-ae10-ea5742c3b83e: interrupted system call" logSource="pkg/restore/restore.go:1296" restore=...

I ran the restore several times, and on the 4th attempt the restore completed without errors. The failed restores had different pods failing.

My observations suggest that this is only happening for pods using RWX PVs. Is this a race condition caused by simultaneous attempts to restore a given volume on behalf of multiple pods? Is it a benign error (volume will have been restored successfully on behalf of one pod) or does it prevent non-shared PVs from being restored, e.g. is it the case that:

pod1: shared-data
pod2: shared-data, data

"shared-data" is restored on behalf of pod1
"shared-data" can't be restored on behalf of pod2 ("interrupted system call") but it does restore "data"

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restic volume restore fails with ‘interrupted system call’ #3423

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Restic volume restore fails with ‘interrupted system call’ #3423

jbmassicotte Jan 25, 2021

Replies: 1 comment

JonDGH Mar 2, 2021

jbmassicotte
Jan 25, 2021

JonDGH
Mar 2, 2021