Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPI cluster stuck in failed after infrastructure cluster has failureMessage #10991

Closed
cwrau opened this issue Aug 2, 2024 · 7 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@cwrau
Copy link
Contributor

cwrau commented Aug 2, 2024

What steps did you take and what happened?

As seen in

if failureReason != "" {
clusterStatusError := capierrors.ClusterStatusError(failureReason)
cluster.Status.FailureReason = &clusterStatusError
}
if failureMessage != "" {
cluster.Status.FailureMessage = ptr.To(
fmt.Sprintf("Failure detected from referenced resource %v with name %q: %s",
obj.GroupVersionKind(), obj.GetName(), failureMessage),
)
}
, when the referenced infrastructure cluster has any error, this gets copied onto the CAPI cluster, then in
if cluster.Status.FailureReason != nil || cluster.Status.FailureMessage != nil {
cluster.Status.SetTypedPhase(clusterv1.ClusterPhaseFailed)
}
this is used to set the phase to failed

But I can't find any line that sets the failureMessage and failureReason to nil again.

What did you expect to happen?

That the failureMessage and failureReason get's reset at some point

Cluster API version

1.6.3, but the same code is in main as well

Kubernetes version

Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.14

Anything else you would like to add?

If no one objects, I'd open a PR with the following changes;

diff --git a/internal/controllers/cluster/cluster_controller_phases.go b/internal/controllers/cluster/cluster_controller_phases.go
index 4afa3976f..df0335218 100644
--- a/internal/controllers/cluster/cluster_controller_phases.go
+++ b/internal/controllers/cluster/cluster_controller_phases.go
@@ -135,12 +135,16 @@ func (r *Reconciler) reconcileExternal(ctx context.Context, cluster *clusterv1.C
 	if failureReason != "" {
 		clusterStatusError := capierrors.ClusterStatusError(failureReason)
 		cluster.Status.FailureReason = &clusterStatusError
+	} else {
+		cluster.Status.FailureReason = nil
 	}
 	if failureMessage != "" {
 		cluster.Status.FailureMessage = ptr.To(
 			fmt.Sprintf("Failure detected from referenced resource %v with name %q: %s",
 				obj.GroupVersionKind(), obj.GetName(), failureMessage),
 		)
+	} else {
+		cluster.Status.FailureMessage = nil
 	}
 
 	return external.ReconcileOutput{Result: obj}, nil

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 2, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer
Copy link
Member

sbueringer commented Aug 2, 2024

failureMessage / failureReason are supposed to be used to signal terminal failures. Terminal in a sense of that they cannot be recovered from. That is why there are no code paths to unset them again

(we've been trying to get rid of the concept of terminal failure for a while, looks like with v1beta2 we'll get around to actually doing it, see: #10897)

@cwrau
Copy link
Contributor Author

cwrau commented Aug 2, 2024

failureMessage / failureReason are supposed to be used to signal terminal failures. Terminal in a sense of that they cannot be recovered from. That is why there are no code paths to unset them again

Ok, so this is an issue with cluster API provider OpenStack (with which this is happening)? They shouldn't set these fields if it's not a terminal failure then?

@sbueringer
Copy link
Member

Correct!

@sbueringer
Copy link
Member

(manual workaround is ~ kubectl edit --subresource=status)

@fabriziopandini
Copy link
Member

Looking forward to get #10997 implemented and get rid of failureMessage / failureReason

/close
(based on comments above)

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

Looking forward to get #10997 implemented and get rid of failureMessage / failureReason

/close
(based on comments above)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

4 participants