-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example Prometheus Rule to monitor Velero seems bad #7132
Comments
In your description . I think the user should know it through That is just the demo for the normal guide for prometheusRule. I think it is ok.
|
Thanks for coming back to me!
I am not sure I understand what you mean by "the user should know". Of course if he actively is checking it manually or somehow, he would see it. But the example rule was (at least here where I was finding it at our company) interpreted as "if 25% of the backups do fail, I am getting alerted" which without mentioning that this is is measured "since the beginning of time (pod start)" is assumed to be like something which would actually help in detecting issues and getting alerted in a timely manner (at least that was the case here where it was used at our company). So in my example, if you rely on this PrometheusRule (the example) you would see it after 122 days unless you check manually which you normally only do when you actually need a backup.. and then it is too late, isn't it?
But what I mean is: this rule is very risky as it is not really giving a valuable information. From my example you can see that it would not alert you for 122 days and then I would argue that this is not a useful alert in the first place. |
@savar But 2 : I mean , i just think this situation is so abnormal.
|
Yeah that is what I mean.. I found this example query by "how to monitor velero" .. maybe it is a bad assumption from my end, but when I saw that, I assumed that this is about "is velero and the backups working" but it seems more like a "is velero in general having trouble or not" thing. Maybe it is enough to state that more clearly in the example, that this is not for monitoring individual schedules/backups but for the overall healthiness of velero. But even if you would state that, I still think, that this example metric is not good even for the overall velero check. I would limit this to a timeframe by using an |
I agree that more should be made of the fact that the example monitoring is not fit for monitoring of production backups. I blindly thought that failed backups were being monitored because at the start I was getting alerted. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands. |
@yanggangtony any chance to update the example query and the documentation for it? |
@allenxu404 Do you think it is ok to update the docs? |
Correct me if I'm wrong. It seems that the existing |
Yes, i think it is good to issue in |
Opened an issue on the helm chart part, therefore closing it here now. |
We saw in our different clusters different reporting even though Velero was affected the same in all three clusters.
Checking again on the implemented PrometheusRule it is as described in #2725 and also the example from #397 but is this the right choice? The metric
velero_backup_attempt_total
, as long as no restart happened, is only growing. So using a ratio of failed attempts to an ever growing total should almost make you blind for failures after a long time of "all is good".Assuming you run 20 backups per day for a specific
schedule
and this works fine for a year and your Pod isn't restarting in that time (and that is possible), you would have 7300 successful attempts. If you would do the example query and check for a failure rate of more than 25% you would need 2434 failed attemps before you hit that mark or ~122 days before you even realize that your backups aren't working anymore.I am not sure what the best approach would be, but it might be either using an
increase()
over a shorter time instead of "the whole time the pod is running" or usingvelero_backup_last_successful_timestamp
or other things instead.My bug here is mainly: should we change the example (even though it is just a comment) to avoid people simply copy and pasting this and using it as a way to monitor velero backups?
The text was updated successfully, but these errors were encountered: