Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use keep_firing_for to prevent flappy alerts #63

Merged
merged 4 commits into from
Jan 30, 2024

Conversation

a-june
Copy link
Contributor

@a-june a-june commented Nov 12, 2023

Proposition to use keep_firing_for in missing replicas alerts to reduce alert flapping.

Address situations where replica sets become available for a short time periods.

Default time: 10m

Example
flappy

@@ -6,6 +6,7 @@ groups:
rules:
- alert: DeploymentMissingReplicas
expr: (kube_deployment_spec_replicas != kube_deployment_status_replicas_available) * ON (deployment, namespace) group_left(annotation_app_uw_systems_tier, annotation_app_uw_systems_system, annotation_app_uw_systems_owner) kube_deployment_annotations{}
keep_firing_for: 5m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks Anna - should the value be the same for: for and keep_firing_for?

They are doing the same thing right? One is before alert triggers and the other is once its triggered. But the same purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering aligning for and keep_firing_for because it seems less arbitrary but I think that might defeat the purpose of still wanting to alert quickly but avoid alerts being re-opened frequently.
Example:
Deployment is missing replicas for 6 minutes - you want to receive an alert within 5 minutes (for: 5m) and prevent alert from reopening for next 10 minutes (keep_firing_for:10m)

@george-angel
Copy link
Member

I forgot about this, thanks to HH for reminder: thanos-io/thanos#6165

@a-june
Copy link
Contributor Author

a-june commented Dec 1, 2023

I forgot about this, thanks to HH for reminder: thanos-io/thanos#6165

thanos-io/thanos#6943 🤞

@a-june a-june marked this pull request as ready for review December 1, 2023 13:51
@a-june
Copy link
Contributor Author

a-june commented Dec 20, 2023

Not included in the latest release 😞
https://github.com/thanos-io/thanos/releases/tag/v0.33.0

@a-june
Copy link
Contributor Author

a-june commented Jan 18, 2024

👀 It's coming @george-angel https://github.com/thanos-io/thanos/releases/tag/v0.34.0-rc.0
Will dust off PR and get it ready.

@george-angel george-angel merged commit 37c5264 into main Jan 30, 2024
1 check passed
@george-angel george-angel deleted the replicas-keep-firing branch January 30, 2024 09:48
@a-june
Copy link
Contributor Author

a-june commented Jan 30, 2024

Set for and keep_firing_for to the same values at the end to match the the responsiveness of the alert and the frequency of expression change.

MissingDaemonSetReplicas alerts have moved and I haven't included fire_for there.

@george-angel
Copy link
Member

MissingDaemonSetReplicas - we figured no teams have DaemonSets, so no sense exposing that alert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants