Exposed Prometheus metrics for unreachable workload clusters #1281

perithompson · 2021-10-27T09:31:02Z

/kind feature

Describe the solution you'd like
When a Management cluster cannot reach a Workload Cluster it may be necessary to pause reconciliation of that cluster until a time when cluster connectivity can be restored in order to prevent capi from constantly trying to reconcile something it can't. There is an issue(#5394) that outlines some gaps in the documentation around this.

One problem operators will face is knowing when this state occurs. A simple way to monitor for this state is a metric for monitoring workload clusters. Having count of workload clusters, count of paused and count of unreachable clusters exposed to Prometheus would allow for an alert on a change to unreachable cluster count and then operators could implement some automation or SOPs to check and pause clusters that cannot reconcile for any reason.

Anything else you would like to add:

Environment:

The text was updated successfully, but these errors were encountered:

k8s-triage-robot · 2022-01-25T10:23:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

srm09 · 2022-01-28T22:25:28Z

@perithompson This seems like something that will sit in the CAPI repo. I am happy to keep this one around if CAPV would need to do something specifically, but I think the entire change will rest directly in CAPI.

k8s-triage-robot · 2022-02-27T23:13:25Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

srm09 · 2022-03-01T01:41:50Z

/remove-lifecycle rotten
/lifecycle frozen

chrischdi · 2023-08-17T17:36:38Z

Maybe gets resolved or partially resolved in #2061

killianmuldoon · 2023-08-17T18:10:08Z

I think this should be closed in favor of a solution on the CAPI side - there's a related issue here: kubernetes-sigs/cluster-api#5510

sbueringer · 2023-08-17T18:22:08Z

+1

/close

k8s-ci-robot · 2023-08-17T18:22:12Z

@sbueringer: Closing this issue.

In response to this:

+1

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 27, 2021

perithompson mentioned this issue Oct 27, 2021

Raise a metric whenever CAPI cannot see a remote cluster client kubernetes-sigs/cluster-api#5510

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 27, 2022

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Mar 1, 2022

k8s-ci-robot closed this as completed Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exposed Prometheus metrics for unreachable workload clusters #1281

Exposed Prometheus metrics for unreachable workload clusters #1281

perithompson commented Oct 27, 2021

k8s-triage-robot commented Jan 25, 2022

srm09 commented Jan 28, 2022

k8s-triage-robot commented Feb 27, 2022

srm09 commented Mar 1, 2022

chrischdi commented Aug 17, 2023

killianmuldoon commented Aug 17, 2023

sbueringer commented Aug 17, 2023

k8s-ci-robot commented Aug 17, 2023

Exposed Prometheus metrics for unreachable workload clusters #1281

Exposed Prometheus metrics for unreachable workload clusters #1281

Comments

perithompson commented Oct 27, 2021

k8s-triage-robot commented Jan 25, 2022

srm09 commented Jan 28, 2022

k8s-triage-robot commented Feb 27, 2022

srm09 commented Mar 1, 2022

chrischdi commented Aug 17, 2023

killianmuldoon commented Aug 17, 2023

sbueringer commented Aug 17, 2023

k8s-ci-robot commented Aug 17, 2023