Velero pod gets evicted because the node is running low on ephemeral-storage #7718

jeremyvdveen · 2024-04-22T11:03:11Z

What steps did you take and what happened:

I'm facing a problem where the Velero pod gets evicted, because the node it was running on ran low on ephemeral-storage.
When I describe the pod:

The node was low on resource: ephemeral-storage. Threshold quantity: 7775265486, available: 7549392Ki. Container velero was using 4579368Ki, request is 0, has larger consumption of ephemeral-storage.

What did you expect to happen:
I would expect that the pod keeps running.

Anything else you would like to add:

I'm using version 6.0.0 of the Velero helm-chart.
I've done some investigation into this issue myself and found that the Restic cache that is stored in the scratch directory is causing the problem.
When I describe my backuprepository, I see the following error message:

error running command=restic prune --repo=s3:s3-eu-west-1.amazonaws.com/company-x/restic/application-x --password-file=/tmp/credentials/velero/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=, stderr=unable to create lock in backend: repository is already locked exclusively by PID 156300 on velero-58b69fbd4f-78jkr by cnb (UID 1002, GID 1000)
    lock was created at 2024-03-13 23:27:01 (29m59.784499528s ago)
    storage ID 4174ac16
    the `unlock` command can be used to remove stale locks
    : exit status 1
  phase: Ready

We are running Velero for a lot of our customers, but we only see this behavior on two of our customer environments. It's also good to mention that the back-ups are still succeeding, but it's still undesired behavior.

I've also tried running the restic prune command manually and noticed a very rapid increase in cache-size, resulting in the pod getting evicted again.
Before starting the prune command the cache size was 132K, the pod got evicted when the cache was between 20G and 25G.
I would have liked to use a PVC, instead of an emptyDir for the scratch volume, as mentioned in #2087, but the helm-chart doesn't allow for that. I also tried pruning without using the cache, but as stated when running the command, it's very slow.
When going through the Github issues for this project, I've found a few related issues:
#7177
#2087

Environment:

Velero version: v1.13.0
Kubernetes version: v1.28.6
Cloud provider or hardware configuration: AWS - m6i.4xlarge
OS: Ubuntu 22.04.3

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2024-04-23T02:32:25Z

I have opened another issue for Kopia path #7725. However, we probably would not add this enhancement for Restic path since it will be deprecated soon.

jeremyvdveen · 2024-04-23T08:35:20Z

Can you share with me what will be replacing it. I don't mind waiting a bit for a solution, but if there are no plans to resolve this bug in the long-term, our company might have to look at another solution to resolve this bug.

Lyndon-Li · 2024-04-23T08:47:01Z

The Restic path for fs-backup will be deprecated/removed, only Kopia path is kept. So in order to wait for a solution, you need to switch to Kopia path.

jeremyvdveen · 2024-04-23T09:32:57Z

Thanks for the insight. In that case I guess we'll have to start moving to Kopia Path.

github-actions · 2024-06-24T01:51:47Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

github-actions · 2024-07-08T01:53:00Z

This issue was closed because it has been stalled for 14 days with no activity.

Lyndon-Li mentioned this issue Apr 23, 2024

Make kopia repo cache place configurable #7725

Open

github-actions bot added the staled label Jun 24, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velero pod gets evicted because the node is running low on ephemeral-storage #7718

Velero pod gets evicted because the node is running low on ephemeral-storage #7718

jeremyvdveen commented Apr 22, 2024

Lyndon-Li commented Apr 23, 2024

jeremyvdveen commented Apr 23, 2024

Lyndon-Li commented Apr 23, 2024

jeremyvdveen commented Apr 23, 2024

github-actions bot commented Jun 24, 2024

github-actions bot commented Jul 8, 2024

Velero pod gets evicted because the node is running low on ephemeral-storage #7718

Velero pod gets evicted because the node is running low on ephemeral-storage #7718

Comments

jeremyvdveen commented Apr 22, 2024

Lyndon-Li commented Apr 23, 2024

jeremyvdveen commented Apr 23, 2024

Lyndon-Li commented Apr 23, 2024

jeremyvdveen commented Apr 23, 2024

github-actions bot commented Jun 24, 2024

github-actions bot commented Jul 8, 2024