Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prow cluster autoscaler stuck occasionally #881

Open
lentzi90 opened this issue Oct 21, 2024 · 1 comment
Open

Prow cluster autoscaler stuck occasionally #881

lentzi90 opened this issue Oct 21, 2024 · 1 comment
Labels
triage/accepted Indicates an issue is ready to be actively worked on.

Comments

@lentzi90
Copy link
Member

Due to this issue kubernetes/autoscaler#6490 (comment) the cluster autoscaler sometimes gets stuck in a loop where it thinks it doesn't have enough privileges to continue.
Deleting the pod gets it going again, but this depends on someone noticing it.

I propose that we either monitor the autoscaler every day to detect when it gets stuck OR we add the RBAC the controller thinks it needs until the upstream issue is fixed.

@metal3-io-bot metal3-io-bot added the needs-triage Indicates an issue lacks a `triage/foo` label and requires one. label Oct 21, 2024
@tuminoid
Copy link
Member

Unless the RBAC it thinks it needs is very invasive, I think that is better workaround than constant manual monitoring. If implemented with RBAC, let's make sure we have revert PR or issue for reverting the change available right after merge.

/triage accepted

@metal3-io-bot metal3-io-bot added triage/accepted Indicates an issue is ready to be actively worked on. and removed needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/accepted Indicates an issue is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants