Fix reconciliation failure monitor. #555
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previsouly the expr was trying to include the timeframe check. The problem with this is that the last_success timestamp is only updated when the Reconcile method is triggered. It is possible to have no update or triggering of the Reconcile method for over an hour (no changes to ClusterPolicy, DaemonSets, or GPU Nodes). When this happens and a new GPU Node is added, an alert will immediately fire because the last_success timestamp hasn't updated in over an hour and the reconciliation status will change to "0"/"NotReady" as the new GPU is being proessed.
Resolve this by using the timeframe tracking of Alertmananger and only watching for reconcilitation_status not being Ready (1).
Hello!
Thanks for making this contribution! When contributing to this repository please keep in mind the following: