Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero fails to expose correct backup metrics after a pod restart #6936

Open
Ahmad-Faizan opened this issue Oct 10, 2023 · 20 comments
Open

Velero fails to expose correct backup metrics after a pod restart #6936

Ahmad-Faizan opened this issue Oct 10, 2023 · 20 comments
Assignees
Labels
backlog Metrics Related to prometheus metrics

Comments

@Ahmad-Faizan
Copy link
Contributor

Ahmad-Faizan commented Oct 10, 2023

What steps did you take and what happened:
The metric velero_backup_last_status exposes the status of the latest backups.
Once a backup has been taken, the metric gets updated.
However, a pod restart in between any two scheduled backups resets the metric exposed by velero_backup_last_status.
The metric only gets updated for backups after they are created.

What did you expect to happen:
Ideally, the metric should read the list of backups and set the velero_backup_last_status metric.
So if a backup happens at 12:00 and the velero pod is restarted or killed at 12:30, the metric
should not be set to 0 (which indicates no backup has been taken).

The following information will help us better understand what's going on:

Environment:

  • Velero version (use velero version): v1.11.0
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version): v1.26.8
  • Kubernetes installer & version: kops v1.26.5
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Ubuntu 20.04.5 LTS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@yanggangtony
Copy link
Contributor

you use the version is 1.11?
the default is changed to 1 in https://github.com/vmware-tanzu/velero/pull/6838/files

@yanggangtony
Copy link
Contributor

I check the release 1.12 , the pr is not include in that.
Maybe wait release 1.13 will be include?

@allenxu404
Copy link
Contributor

Ideally, the metric should read the list of backups and set the velero_backup_last_status metric.

Do you mean the velero_backup_last_status metric should read the most recently completed backup before the Velero pod restarts to determine the metrics' value for schedule backup?

The metrics is reset to 0 when Velero pod restarts because the default value of the metrics is 0. This has been changed in the PR: #6838 as @yanggangtony mentioned.

@jkroepke
Copy link

Do you mean the velero_backup_last_status metric should read the most recently completed backup before the Velero pod restarts to determine the metrics' value for schedule backup?

Yes. A default value of one seems pointless to me.

If a backup fails and then velero gets an restart, the metric reflects a wrong state. if have the personal feeling that using a default value as initial value results into unexpected behavior.

@yanggangtony
Copy link
Contributor

yanggangtony commented Oct 14, 2023

@jkroepke

A default value of one seems pointless to me.

in this issue issues/6809 , we observed when velero gets an restart, the schedule will continue a new cron runing.

So the default value will be changed when it hits the error,and changed to value 0.


And you suggest maybe want to not init the value of 'velero_backup_last_status' , but realtime calculate the most recently completed backup.

This maybe get a opinion and discuss with maintaners , like @allenxu404 @sseago @ywk253100

@jkroepke
Copy link

Yes, I'm expecting the same behavior from velero_backup_last_successful_timestamp where the timestamp is the real timestamp of the latest backup and not an default value.

@reasonerjt reasonerjt added the Metrics Related to prometheus metrics label Oct 16, 2023
@Ahmad-Faizan
Copy link
Contributor Author

Other metrics from velero expose non-default values after a pod restart.
A similar behaviour is expected from this metric too, as @jkroepke mentioned - velero can calculate the timestamp of the last successful backup and exposes correct metric in the case of velero_backup_last_successful_timestamp even after a pod restart.

@weshayutin
Copy link
Contributor

@mpryc FYI

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@yanggangtony
Copy link
Contributor

not stale

@github-actions github-actions bot removed the staled label Dec 21, 2023
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@jkroepke
Copy link

I feel this is still relevant

@mpryc
Copy link
Contributor

mpryc commented Feb 20, 2024

The velero_backup_last_status could be re-read on velero restart as well as some of the other metrics within https://github.com/vmware-tanzu/velero/blob/main/pkg/metrics/metrics.go#L31-L86

This may have side effects which needs to be checked on the time of such event. When the velero restarts and the metric will re-read information around backups and it's states the time of such even will be the time of velero restart and not the actual backup. This does not apply to all the metrics, but metrics such as backupLastSuccessfulTimestamp needs to be carefully handled.

Another solution would be to not represent any metrics after restart and only show the ones which happens after restart. This will however require modifications on the query of the prometheus DB to gather information about past events.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@jkroepke
Copy link

For weekly backups, it also take a while until the status is correctly reported.

@github-actions github-actions bot removed the staled label Apr 22, 2024
@vinayan3
Copy link

vinayan3 commented Jun 1, 2024

We are hitting this issue as well where the pod gets restarted because we replace nodes on a regular cadenace. This causes backup metrics to be misreported. We have held off putting up alerts because of this.

@jkroepke
Copy link

jkroepke commented Jun 1, 2024

@vinayan3 we are using velero_backup_last_successful_timestamp, since it works as expected.

velero_backup_last_status is just un-useable for now.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@vinayan3
Copy link

This isn't stale.

@kaovilai
Copy link
Member

unstalev2

@github-actions github-actions bot removed the staled label Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Metrics Related to prometheus metrics
Projects
None yet
Development

No branches or pull requests

9 participants