CAPI cluster stuck in failed after infrastructure cluster has failureMessage #10991

cwrau · 2024-08-02T08:14:09Z

What steps did you take and what happened?

As seen in

cluster-api/internal/controllers/cluster/cluster_controller_phases.go

Lines 135 to 144 in 4a0900c

    
           if failureReason != "" { 
        
           	clusterStatusError := capierrors.ClusterStatusError(failureReason) 
        
           	cluster.Status.FailureReason = &clusterStatusError 
        
           } 
        
           if failureMessage != "" { 
        
           	cluster.Status.FailureMessage = ptr.To( 
        
           		fmt.Sprintf("Failure detected from referenced resource %v with name %q: %s", 
        
           			obj.GroupVersionKind(), obj.GetName(), failureMessage), 
        
           	) 
        
           }

, when the referenced infrastructure cluster has any error, this gets copied onto the CAPI cluster, then in

cluster-api/internal/controllers/cluster/cluster_controller_phases.go

Lines 59 to 61 in 4a0900c

    
           if cluster.Status.FailureReason != nil || cluster.Status.FailureMessage != nil { 
        
           	cluster.Status.SetTypedPhase(clusterv1.ClusterPhaseFailed) 
        
           }

this is used to set the phase to failed

But I can't find any line that sets the failureMessage and failureReason to nil again.

What did you expect to happen?

That the failureMessage and failureReason get's reset at some point

Cluster API version

1.6.3, but the same code is in main as well

Kubernetes version

Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.14

Anything else you would like to add?

If no one objects, I'd open a PR with the following changes;

diff --git a/internal/controllers/cluster/cluster_controller_phases.go b/internal/controllers/cluster/cluster_controller_phases.go
index 4afa3976f..df0335218 100644
--- a/internal/controllers/cluster/cluster_controller_phases.go
+++ b/internal/controllers/cluster/cluster_controller_phases.go
@@ -135,12 +135,16 @@ func (r *Reconciler) reconcileExternal(ctx context.Context, cluster *clusterv1.C
 	if failureReason != "" {
 		clusterStatusError := capierrors.ClusterStatusError(failureReason)
 		cluster.Status.FailureReason = &clusterStatusError
+	} else {
+		cluster.Status.FailureReason = nil
 	}
 	if failureMessage != "" {
 		cluster.Status.FailureMessage = ptr.To(
 			fmt.Sprintf("Failure detected from referenced resource %v with name %q: %s",
 				obj.GroupVersionKind(), obj.GetName(), failureMessage),
 		)
+	} else {
+		cluster.Status.FailureMessage = nil
 	}
 
 	return external.ReconcileOutput{Result: obj}, nil

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-08-02T08:14:18Z

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sbueringer · 2024-08-02T09:30:14Z

failureMessage / failureReason are supposed to be used to signal terminal failures. Terminal in a sense of that they cannot be recovered from. That is why there are no code paths to unset them again

(we've been trying to get rid of the concept of terminal failure for a while, looks like with v1beta2 we'll get around to actually doing it, see: #10897)

cwrau · 2024-08-02T09:35:06Z

failureMessage / failureReason are supposed to be used to signal terminal failures. Terminal in a sense of that they cannot be recovered from. That is why there are no code paths to unset them again

Ok, so this is an issue with cluster API provider OpenStack (with which this is happening)? They shouldn't set these fields if it's not a terminal failure then?

sbueringer · 2024-08-02T10:49:46Z

Correct!

sbueringer · 2024-08-02T10:50:01Z

(manual workaround is ~ kubectl edit --subresource=status)

fabriziopandini · 2024-08-02T13:23:48Z

Looking forward to get #10997 implemented and get rid of failureMessage / failureReason

/close
(based on comments above)

k8s-ci-robot · 2024-08-02T13:23:52Z

@fabriziopandini: Closing this issue.

In response to this:

Looking forward to get #10997 implemented and get rid of failureMessage / failureReason

/close
(based on comments above)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 2, 2024

k8s-ci-robot closed this as completed Aug 2, 2024

cwrau mentioned this issue Aug 5, 2024

Cluster Erroneously Stuck in Failed State kubernetes-sigs/cluster-api-provider-openstack#2146

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAPI cluster stuck in failed after infrastructure cluster has failureMessage #10991

CAPI cluster stuck in failed after infrastructure cluster has failureMessage #10991

cwrau commented Aug 2, 2024

k8s-ci-robot commented Aug 2, 2024

sbueringer commented Aug 2, 2024 •

edited

Loading

cwrau commented Aug 2, 2024

sbueringer commented Aug 2, 2024

sbueringer commented Aug 2, 2024

fabriziopandini commented Aug 2, 2024

k8s-ci-robot commented Aug 2, 2024

CAPI cluster stuck in failed after infrastructure cluster has failureMessage #10991

CAPI cluster stuck in failed after infrastructure cluster has failureMessage #10991

Comments

cwrau commented Aug 2, 2024

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

k8s-ci-robot commented Aug 2, 2024

sbueringer commented Aug 2, 2024 • edited Loading

cwrau commented Aug 2, 2024

sbueringer commented Aug 2, 2024

sbueringer commented Aug 2, 2024

fabriziopandini commented Aug 2, 2024

k8s-ci-robot commented Aug 2, 2024

sbueringer commented Aug 2, 2024 •

edited

Loading