Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example Prometheus Rule to monitor Velero seems bad #7132

Closed
savar opened this issue Nov 21, 2023 · 11 comments
Closed

Example Prometheus Rule to monitor Velero seems bad #7132

savar opened this issue Nov 21, 2023 · 11 comments
Assignees
Labels
Metrics Related to prometheus metrics

Comments

@savar
Copy link

savar commented Nov 21, 2023

We saw in our different clusters different reporting even though Velero was affected the same in all three clusters.

Checking again on the implemented PrometheusRule it is as described in #2725 and also the example from #397 but is this the right choice? The metric velero_backup_attempt_total, as long as no restart happened, is only growing. So using a ratio of failed attempts to an ever growing total should almost make you blind for failures after a long time of "all is good".

Assuming you run 20 backups per day for a specific schedule and this works fine for a year and your Pod isn't restarting in that time (and that is possible), you would have 7300 successful attempts. If you would do the example query and check for a failure rate of more than 25% you would need 2434 failed attemps before you hit that mark or ~122 days before you even realize that your backups aren't working anymore.

I am not sure what the best approach would be, but it might be either using an increase() over a shorter time instead of "the whole time the pod is running" or using velero_backup_last_successful_timestamp or other things instead.

My bug here is mainly: should we change the example (even though it is just a comment) to avoid people simply copy and pasting this and using it as a way to monitor velero backups?

@yanggangtony
Copy link
Contributor

yanggangtony commented Nov 21, 2023

 If you would do the example query and check for a failure rate of more than 25% 
you would need 2434 failed attemps before you hit that mark
 or ~122 days before you even realize that your backups aren't working anymore.

In your description . I think the user should know it through get backup or describe backup, describe schedule .
And not when it is after 122days.

That is just the demo for the normal guide for prometheusRule. I think it is ok.

/velero-helm-charts/charts/velero/values.yaml
 
 # - alert: VeleroBackupPartialFailures
    #   annotations:
    #     message: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} partialy failed backups.
    #   expr: |-
    #     velero_backup_partial_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
    #   for: 15m
    #   labels:
    #     severity: warning
    # - alert: VeleroBackupFailures
    #   annotations:
    #     message: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} failed backups.
    #   expr: |-
    #     velero_backup_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
    #   for: 15m
    #   labels:
    #     severity: warning

@savar
Copy link
Author

savar commented Nov 21, 2023

Thanks for coming back to me!

In your description . I think the user should know it through get backup or describe backup, describe schedule . And not when it is after 122days.

I am not sure I understand what you mean by "the user should know". Of course if he actively is checking it manually or somehow, he would see it. But the example rule was (at least here where I was finding it at our company) interpreted as "if 25% of the backups do fail, I am getting alerted" which without mentioning that this is is measured "since the beginning of time (pod start)" is assumed to be like something which would actually help in detecting issues and getting alerted in a timely manner (at least that was the case here where it was used at our company).

So in my example, if you rely on this PrometheusRule (the example) you would see it after 122 days unless you check manually which you normally only do when you actually need a backup.. and then it is too late, isn't it?

That is just the demo for the normal guide for prometheusRule. I think it is ok.

But what I mean is: this rule is very risky as it is not really giving a valuable information. From my example you can see that it would not alert you for 122 days and then I would argue that this is not a useful alert in the first place.

@yanggangtony
Copy link
Contributor

yanggangtony commented Nov 21, 2023

@savar
Thanks for explain your complaint about that.

But
1 : first this is not just the metrics velero_backup_attempt_total .
The velero_backup_partial_failure_total , velero_backup_failure_total also is the same.
This 3 are calculated like you say "since the beginning of time (pod start)".

2 :
velero_backup_partial_failure_total , velero_backup_failure_total , velero_backup_attempt_total is a statistics count through whole shedule process .
It is right for the statistics count through whole shedule process lifecycle .

I mean , i just think this situation is so abnormal.

run 20 backups per day, have 7300 successful attempts. 
 you would need 2434 failed attemps before you hit that mark 
or ~122 days before you even realize 
that your backups aren't working anymore.

you say for 122 days to notice that is so werid , i still think.
How about a task that is runing normally in 365days , and suddenly not works as it before.
If that happens , i think there must be a other monitor metrics for known about that, for example,
like the machine healthy? the storage healthy? the network healthy?

@reasonerjt reasonerjt added the Metrics Related to prometheus metrics label Nov 23, 2023
@savar
Copy link
Author

savar commented Nov 23, 2023

Yeah that is what I mean.. I found this example query by "how to monitor velero" .. maybe it is a bad assumption from my end, but when I saw that, I assumed that this is about "is velero and the backups working" but it seems more like a "is velero in general having trouble or not" thing. Maybe it is enough to state that more clearly in the example, that this is not for monitoring individual schedules/backups but for the overall healthiness of velero.

But even if you would state that, I still think, that this example metric is not good even for the overall velero check. I would limit this to a timeframe by using an increase() over a time.. so maybe 24h and use the ratio on that level instead of "over all time". Like I said, the metric is getting less valuable the longer the pod is running. I would not know what the "metric/alert" would actually tell me at the end.

@zanoni23
Copy link

I agree that more should be made of the fact that the example monitoring is not fit for monitoring of production backups. I blindly thought that failed backups were being monitored because at the start I was getting alerted.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@savar
Copy link
Author

savar commented Feb 12, 2024

@yanggangtony any chance to update the example query and the documentation for it?

@yanggangtony
Copy link
Contributor

@allenxu404 Do you think it is ok to update the docs?

@github-actions github-actions bot removed the staled label Feb 13, 2024
@allenxu404
Copy link
Contributor

Correct me if I'm wrong. It seems that the existing prometheusRule for monitoring backup partial failures provided by Velero helm-charts are not always accurate regarding all user cases.So it's better to raise this issue on vmware-tanzu/helm-charts instead. I think either adding the comment to the rule or modifying the rule with more accurate expression could be the potential improvements.

@yanggangtony
Copy link
Contributor

Yes, i think it is good to issue in vmware-tanzu/helm-charts

@savar
Copy link
Author

savar commented Apr 4, 2024

Opened an issue on the helm chart part, therefore closing it here now.

@savar savar closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Metrics Related to prometheus metrics
Projects
None yet
Development

No branches or pull requests

5 participants