-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exposed Prometheus metrics for unreachable workload clusters #1281
Comments
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
@perithompson This seems like something that will sit in the CAPI repo. I am happy to keep this one around if CAPV would need to do something specifically, but I think the entire change will rest directly in CAPI. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
Maybe gets resolved or partially resolved in #2061 |
I think this should be closed in favor of a solution on the CAPI side - there's a related issue here: kubernetes-sigs/cluster-api#5510 |
+1 /close |
@sbueringer: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind feature
Describe the solution you'd like
When a Management cluster cannot reach a Workload Cluster it may be necessary to pause reconciliation of that cluster until a time when cluster connectivity can be restored in order to prevent capi from constantly trying to reconcile something it can't. There is an issue(#5394) that outlines some gaps in the documentation around this.
One problem operators will face is knowing when this state occurs. A simple way to monitor for this state is a metric for monitoring workload clusters. Having
count of workload clusters
,count of paused
andcount of unreachable clusters
exposed to Prometheus would allow for an alert on a change to unreachable cluster count and then operators could implement some automation or SOPs to check and pause clusters that cannot reconcile for any reason.Anything else you would like to add:
Environment:
The text was updated successfully, but these errors were encountered: