Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Current cleanup.yaml ensure-subresources-deleted fails on incomplete installs #145

Open
mallardduck opened this issue Jan 9, 2025 · 0 comments

Comments

@mallardduck
Copy link
Member

Per title, the ensure-subresources-deleted job for the cleanup manifest is prone to errors when installs encounter issues. I observed this via the logs seeing:

Ensuring HelmCharts and HelmReleases are deleted from cattle-monitoring-system...
waiting for HelmCharts and HelmReleases to be deleted from cattle-monitoring-system... sleeping 3 seconds
waiting for HelmCharts and HelmReleases to be deleted from cattle-monitoring-system... sleeping 3 seconds
waiting for HelmCharts and HelmReleases to be deleted from cattle-monitoring-system... sleeping 3 seconds

Which is expected but it should run fairly fast. So I checked the initial kubectl command that the script uses to populate resources to clean, and found:

# kubectl get helmcharts,helmreleases
error: the server doesn't have a resource type "helmreleases"

So it appears that the clean up script is failing due to helmreleases never having been installed. This makes sense because the PromFed container never started. And in this case the CRDs are managed by the container/operator not a CRD specific chart.

We should adjust this cleanup script to not fail when encountering these edge cases. As while they should be uncommon it only increases difficulty a customer would experience recovering from the initial issue I'm investigating.

Further, add-cleanup-annotations is subject to similar but instead the error is specific to authentication it seems. That one is not a consistent error, if I see this again I will report back and update the issue. I suspect it was a weird "race condition" (for lack of better term) where the ServiceAccount was removed and the cleanup script no longer had access. Or something similar?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant