Prow cluster autoscaler stuck occasionally #881

lentzi90 · 2024-10-21T08:21:43Z

Due to this issue kubernetes/autoscaler#6490 (comment) the cluster autoscaler sometimes gets stuck in a loop where it thinks it doesn't have enough privileges to continue.
Deleting the pod gets it going again, but this depends on someone noticing it.

I propose that we either monitor the autoscaler every day to detect when it gets stuck OR we add the RBAC the controller thinks it needs until the upstream issue is fixed.

tuminoid · 2024-10-21T09:04:21Z

Unless the RBAC it thinks it needs is very invasive, I think that is better workaround than constant manual monitoring. If implemented with RBAC, let's make sure we have revert PR or issue for reverting the change available right after merge.

/triage accepted

metal3-io-bot added the needs-triage Indicates an issue lacks a `triage/foo` label and requires one. label Oct 21, 2024

metal3-io-bot added triage/accepted Indicates an issue is ready to be actively worked on. and removed needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prow cluster autoscaler stuck occasionally #881

Prow cluster autoscaler stuck occasionally #881

lentzi90 commented Oct 21, 2024

tuminoid commented Oct 21, 2024

Prow cluster autoscaler stuck occasionally #881

Prow cluster autoscaler stuck occasionally #881

Comments

lentzi90 commented Oct 21, 2024

tuminoid commented Oct 21, 2024