Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With DataMover restored PV is Empty #7189

Closed
hofq opened this issue Dec 7, 2023 · 21 comments
Closed

With DataMover restored PV is Empty #7189

hofq opened this issue Dec 7, 2023 · 21 comments

Comments

@hofq
Copy link

hofq commented Dec 7, 2023

What steps did you take and what happened:

  • I made a Backup for Testing of an Application using this Schedule:
schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: prod-30d-eu-central-1
  namespace: velero
  annotations:
spec: 
  paused: false
  schedule: "0 3 * * *" # 1x pro Tag um 3 Uhr
  useOwnerReferencesInBackup: false
  template:
    ttl: 730h # 30 Tage
    datamover: velero
    snapshotMoveData: true
    volumeSnapshotLocations:
    - csi-gp3
    storageLocation: default
  • i tried restoring it using the follwoing command:
    velero create restore --from-backup <name of backup> --include-namespaces <namespace name>

What did you expect to happen:

  • After everything finished Successfully, i expected the Data to be in the Volume, that was not the case. The Volume was just Empty

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

bundle-2023-12-07-18-18-49.tar.gz

Anything else you would like to add:

  • The DataUpload and DataDownload looked Good. The File size was realistic and both showed no errors

Environment:

  • Velero version (use velero version): v1.12.2 (also Tried v1.12.1 and 1.12.2-rc2)
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version): Client: v1.28.4
  • Kubernetes installer & version: v1.27.7-eks-4f4795d
  • Cloud provider or hardware configuration: EKS with CSI Plugin
  • OS (e.g. from /etc/os-release): Amazon Linux 2

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Dec 8, 2023

Is the volume you are checking for this PVC test-pv-claim?
Could you check below directory in your node(should be in the same node where the pod is running) and share what you can see there?
/var/lib/kubelet/pods/<restored pod's UID>/volumes/kubernetes.io~csi/<restored PVC's UID>/mount

@hofq
Copy link
Author

hofq commented Dec 8, 2023

/var/lib/kubelet/pods/afe222fb-aad9-4d28-9e79-7be40b5504ff
[root@node afe222fb-aad9-4d28-9e79-7be40b5504ff]# tree ./
./
├── containers
│   └── nginx
│       └── b46faa53
├── etc-hosts
├── plugins
│   └── kubernetes.io~empty-dir
│       └── wrapped_kube-api-access-lwl8s
│           └── ready
└── volumes
    ├── kubernetes.io~csi
    │   └── pvc-ab225afb-ac33-4399-bd8d-22c42fdc4126
    │       ├── mount
    │       │   └── lost+found
    │       └── vol_data.json
    └── kubernetes.io~projected
        └── kube-api-access-lwl8s
            ├── ca.crt -> ..data/ca.crt
            ├── namespace -> ..data/namespace
            └── token -> ..data/token

I cannot see a mount Directory on the Node the Pod runs on, so it does not look like it.

@Lyndon-Li
Copy link
Contributor

@hofq Please do the check after the restore and when you see the problem. Don't remove the restored pod.

@hofq
Copy link
Author

hofq commented Dec 8, 2023

The Pod was not removed, but I will re-trigger the restore.

@Lyndon-Li
Copy link
Contributor

If the pod was not removed, you should be able to see /var/lib/kubelet/pods/<restored pod's UID>/volumes/kubernetes.io~csi, if so, please share the dir tree you see from there.

@hofq
Copy link
Author

hofq commented Dec 8, 2023

./
└── pvc-8630c694-199e-4a53-aa13-e10530b8cdf9
    ├── mount
    │   └── lost+found
    └── vol_data.json

This is the Tree directly after the restore

@Lyndon-Li
Copy link
Contributor

What files are in the volume? Did you create them manually or are they owned by any application?

@Lyndon-Li
Copy link
Contributor

I see you were backing up the default namespace, can you move the workload to a dedicate namespace and test the backup/restore.
This is not a general problem, so let's first make the case simple, and see what happened in your env step by step.

@hofq
Copy link
Author

hofq commented Dec 11, 2023

Okay, so i will backup the namespace prod-backup-test using my schedule "prod-30d-eu-central-1", which backs up the Namespace in question and a few others. After that i will restore the Namespace prod-backup-test into prod-backup-test2.

prod-backup-test only carries the pod with the volume and pvc. prod-backup-test2 is non-existant before the restore.

here are the commands i've ran:

velero create backup --from-schedule prod-30d-eu-central-1
velero create restore --from-backup prod-30d-eu-central-1-20231207135032 --include-namespaces prod-backup-test --namespace-mappings prod-backup-test:prod-backup-test2

After Checking the Volume it was empty again.

The Schedule in Question is this one:

apiVersion: velero.io/v1
kind: Schedule
metadata:
  annotations:
  labels:
    app: prod-backup-schedules
    app.kubernetes.io/instance: prod-backup-schedules
    type: backup
  name: prod-30d-eu-central-1
  namespace: velero
spec:
  paused: false
  schedule: 0 3 * * *
  template:
    datamover: velero
    includedNamespaces:
    - kube-system
    - monitoring
    - grafana
    - logging
    - prod-backup-test
    snapshotMoveData: true
    storageLocation: default
    ttl: 730h
    volumeSnapshotLocations:
    - csi-gp3
  useOwnerReferencesInBackup: false
~

@Lyndon-Li
Copy link
Contributor

@hofq Not sure this work to you ---- please find me in the velero-user slack channel, we can have a live session to troubleshoot the problem.

@Lyndon-Li
Copy link
Contributor

@hofq
I have made a fix, please help to verify it through velero/velero:main image from Velero main branch.
Note: if you have used velero/velero:main previously, remember to set the imagePullPolicy to Always for Velero server and node-agent pods.

@hofq
Copy link
Author

hofq commented Dec 12, 2023

on the main Tag i have the issue that the restore won't even do anything:

velero create restore --from-backup test10 --namespace-mappings prod-backup-test:prod-backup-test10 --wait
Restore request "test10-20231212095232" submitted successfully.
Waiting for restore to complete. You may safely press ctrl-c to stop waiting - your restore will continue in the background.
..........................................................................................................................................................................................................................................................................................................................................................................................................................................................W1212 10:02:12.237005    9459 reflector.go:347] k8s.io/client-go@v0.25.6/tools/cache/reflector.go:169: watch of *v1.Restore ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
.....................I1212 10:02:33.200924    9459 trace.go:205] Trace[561098295]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.25.6/tools/cache/reflector.go:169 (12-Dec-2023 10:02:13.578) (total time: 19621ms):
Trace[561098295]: ---"Objects listed" error:<nil> 19621ms (10:02:33.200)
Trace[561098295]: [19.621754666s] [19.621754666s] END
..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

The Process of Restoration just runs forever without anything happening in the cluster. Even the Namespace was not created.

@Lyndon-Li
Copy link
Contributor

Did you see the restore CR created and being processed?

@hofq
Copy link
Author

hofq commented Dec 12, 2023

It is there, but no DataDownload was initiated

@Lyndon-Li
Copy link
Contributor

Please help to collect Velero log bundle, will further troubleshoot.

@hofq
Copy link
Author

hofq commented Dec 12, 2023

@Lyndon-Li
Copy link
Contributor

Probably, I need to make the fix into 1.12 branch. The CRDs have changed in 1.13 (main image) so you client doesn't match them.
Alternatively, you can download Velero's main branch code and use make local command to compile a client. Not sure if you can do this, I will change 1.12 branch anyway, but it takes sometime.

@hofq
Copy link
Author

hofq commented Dec 12, 2023

Thank you

@Lyndon-Li
Copy link
Contributor

@hofq
Fix in 1.12 is ready, please use this image velero/velero:release-1.12-dev for Velero server pod and node-agent pods.

Note: if you have used velero/velero:release-1.12-dev previously, remember to set the imagePullPolicy to Always for Velero server and node-agent pods.

@hofq
Copy link
Author

hofq commented Dec 13, 2023

I can confirm successful restore!

Thanks for your Help

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Dec 14, 2023

This problem is similar to #7027, which will happen in all EKS envs using AWS IAM roles for service accounts.
Under the configuration, a volume called aws-iam-token are inserted as the first location for each pod, including the backup/restore pods created by Velero exposer for data movement. However, Velero always assume the first volume as the backup/restore volume, as a result, Velero data mover reads/writes data to the wrong volume.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants