Example Prometheus Rule to monitor Velero seems bad #7132

savar · 2023-11-21T07:54:04Z

We saw in our different clusters different reporting even though Velero was affected the same in all three clusters.

Checking again on the implemented PrometheusRule it is as described in #2725 and also the example from #397 but is this the right choice? The metric velero_backup_attempt_total, as long as no restart happened, is only growing. So using a ratio of failed attempts to an ever growing total should almost make you blind for failures after a long time of "all is good".

Assuming you run 20 backups per day for a specific schedule and this works fine for a year and your Pod isn't restarting in that time (and that is possible), you would have 7300 successful attempts. If you would do the example query and check for a failure rate of more than 25% you would need 2434 failed attemps before you hit that mark or ~122 days before you even realize that your backups aren't working anymore.

I am not sure what the best approach would be, but it might be either using an increase() over a shorter time instead of "the whole time the pod is running" or using velero_backup_last_successful_timestamp or other things instead.

My bug here is mainly: should we change the example (even though it is just a comment) to avoid people simply copy and pasting this and using it as a way to monitor velero backups?

The text was updated successfully, but these errors were encountered:

yanggangtony · 2023-11-21T11:07:53Z

 If you would do the example query and check for a failure rate of more than 25% 
you would need 2434 failed attemps before you hit that mark
 or ~122 days before you even realize that your backups aren't working anymore.

In your description . I think the user should know it through get backup or describe backup, describe schedule .
And not when it is after 122days.

That is just the demo for the normal guide for prometheusRule. I think it is ok.

/velero-helm-charts/charts/velero/values.yaml
 
 # - alert: VeleroBackupPartialFailures
    #   annotations:
    #     message: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} partialy failed backups.
    #   expr: |-
    #     velero_backup_partial_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
    #   for: 15m
    #   labels:
    #     severity: warning
    # - alert: VeleroBackupFailures
    #   annotations:
    #     message: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} failed backups.
    #   expr: |-
    #     velero_backup_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
    #   for: 15m
    #   labels:
    #     severity: warning

savar · 2023-11-21T11:32:19Z

Thanks for coming back to me!

In your description . I think the user should know it through get backup or describe backup, describe schedule . And not when it is after 122days.

I am not sure I understand what you mean by "the user should know". Of course if he actively is checking it manually or somehow, he would see it. But the example rule was (at least here where I was finding it at our company) interpreted as "if 25% of the backups do fail, I am getting alerted" which without mentioning that this is is measured "since the beginning of time (pod start)" is assumed to be like something which would actually help in detecting issues and getting alerted in a timely manner (at least that was the case here where it was used at our company).

So in my example, if you rely on this PrometheusRule (the example) you would see it after 122 days unless you check manually which you normally only do when you actually need a backup.. and then it is too late, isn't it?

That is just the demo for the normal guide for prometheusRule. I think it is ok.

But what I mean is: this rule is very risky as it is not really giving a valuable information. From my example you can see that it would not alert you for 122 days and then I would argue that this is not a useful alert in the first place.

yanggangtony · 2023-11-21T22:15:21Z

@savar
Thanks for explain your complaint about that.

But
1 : first this is not just the metrics velero_backup_attempt_total .
The velero_backup_partial_failure_total , velero_backup_failure_total also is the same.
This 3 are calculated like you say "since the beginning of time (pod start)".

2 :
velero_backup_partial_failure_total , velero_backup_failure_total , velero_backup_attempt_total is a statistics count through whole shedule process .
It is right for the statistics count through whole shedule process lifecycle .

I mean , i just think this situation is so abnormal.

run 20 backups per day, have 7300 successful attempts. 
 you would need 2434 failed attemps before you hit that mark 
or ~122 days before you even realize 
that your backups aren't working anymore.

you say for 122 days to notice that is so werid , i still think.
How about a task that is runing normally in 365days , and suddenly not works as it before.
If that happens , i think there must be a other monitor metrics for known about that, for example,
like the machine healthy? the storage healthy? the network healthy?

savar · 2023-11-23T10:03:50Z

Yeah that is what I mean.. I found this example query by "how to monitor velero" .. maybe it is a bad assumption from my end, but when I saw that, I assumed that this is about "is velero and the backups working" but it seems more like a "is velero in general having trouble or not" thing. Maybe it is enough to state that more clearly in the example, that this is not for monitoring individual schedules/backups but for the overall healthiness of velero.

But even if you would state that, I still think, that this example metric is not good even for the overall velero check. I would limit this to a timeframe by using an increase() over a time.. so maybe 24h and use the ratio on that level instead of "over all time". Like I said, the metric is getting less valuable the longer the pod is running. I would not know what the "metric/alert" would actually tell me at the end.

zanoni23 · 2023-12-13T19:55:56Z

I agree that more should be made of the fact that the example monitoring is not fit for monitoring of production backups. I blindly thought that failed backups were being monitored because at the start I was getting alerted.

github-actions · 2024-02-12T01:46:46Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

savar · 2024-02-12T17:02:47Z

@yanggangtony any chance to update the example query and the documentation for it?

yanggangtony · 2024-02-13T00:03:06Z

@allenxu404 Do you think it is ok to update the docs?

allenxu404 · 2024-02-21T05:39:42Z

Correct me if I'm wrong. It seems that the existing prometheusRule for monitoring backup partial failures provided by Velero helm-charts are not always accurate regarding all user cases.So it's better to raise this issue on vmware-tanzu/helm-charts instead. I think either adding the comment to the rule or modifying the rule with more accurate expression could be the potential improvements.

yanggangtony · 2024-02-21T11:30:08Z

Yes, i think it is good to issue in vmware-tanzu/helm-charts

savar · 2024-04-04T11:35:58Z

Opened an issue on the helm chart part, therefore closing it here now.

yanggangtony mentioned this issue Nov 21, 2023

Fix wrong description of the release note for "backup_last_status" #7133

Closed

3 tasks

reasonerjt added the Metrics Related to prometheus metrics label Nov 23, 2023

reasonerjt assigned allenxu404 Nov 27, 2023

github-actions bot added the staled label Feb 12, 2024

github-actions bot removed the staled label Feb 13, 2024

savar mentioned this issue Apr 4, 2024

Example Prometheus Rule to monitor Velero seems bad vmware-tanzu/helm-charts#562

Open

savar closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example Prometheus Rule to monitor Velero seems bad #7132

Example Prometheus Rule to monitor Velero seems bad #7132

savar commented Nov 21, 2023

yanggangtony commented Nov 21, 2023 •

edited

Loading

savar commented Nov 21, 2023

yanggangtony commented Nov 21, 2023 •

edited

Loading

savar commented Nov 23, 2023

zanoni23 commented Dec 13, 2023

github-actions bot commented Feb 12, 2024

savar commented Feb 12, 2024

yanggangtony commented Feb 13, 2024

allenxu404 commented Feb 21, 2024

yanggangtony commented Feb 21, 2024

savar commented Apr 4, 2024

Example Prometheus Rule to monitor Velero seems bad #7132

Example Prometheus Rule to monitor Velero seems bad #7132

Comments

savar commented Nov 21, 2023

yanggangtony commented Nov 21, 2023 • edited Loading

savar commented Nov 21, 2023

yanggangtony commented Nov 21, 2023 • edited Loading

savar commented Nov 23, 2023

zanoni23 commented Dec 13, 2023

github-actions bot commented Feb 12, 2024

savar commented Feb 12, 2024

yanggangtony commented Feb 13, 2024

allenxu404 commented Feb 21, 2024

yanggangtony commented Feb 21, 2024

savar commented Apr 4, 2024

yanggangtony commented Nov 21, 2023 •

edited

Loading

yanggangtony commented Nov 21, 2023 •

edited

Loading