From ab02db37fbc2bc19d04979c900d348c3118513fa Mon Sep 17 00:00:00 2001 From: Fabrizio Pandini Date: Thu, 9 May 2024 11:21:00 +0200 Subject: [PATCH] =?UTF-8?q?=F0=9F=93=96=20Document=20failureReason=20and?= =?UTF-8?q?=20Message=20are=20considered=20terminal=20errors=20(#10561)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Document failureReason and Message are considered terminal errors * Address comments * Clarify what cannot be restored anymore means --- .../src/developer/architecture/controllers/cluster.md | 3 +++ .../developer/architecture/controllers/control-plane.md | 3 +++ .../developer/architecture/controllers/machine-pool.md | 8 +++++++- .../src/developer/architecture/controllers/machine.md | 8 ++++++++ docs/book/src/developer/providers/bootstrap.md | 4 ++++ .../src/developer/providers/cluster-infrastructure.md | 3 +++ .../src/developer/providers/machine-infrastructure.md | 3 +++ .../tasks/automated-machine-management/healthchecking.md | 2 +- 8 files changed, 32 insertions(+), 2 deletions(-) diff --git a/docs/book/src/developer/architecture/controllers/cluster.md b/docs/book/src/developer/architecture/controllers/cluster.md index 22f9394477f9..9a36d1b22696 100644 --- a/docs/book/src/developer/architecture/controllers/cluster.md +++ b/docs/book/src/developer/architecture/controllers/cluster.md @@ -50,6 +50,9 @@ is a map, defined as `map[string]FailureDomainSpec`. A unique key must be used f - `controlPlane` (bool): indicates if failure domain is appropriate for running control plane instances. - `attributes` (`map[string]string`): arbitrary attributes for users to apply to a failure domain. +Note: once any of `failureReason` or `failureMessage` surface on the cluster who is referencing the infrastructureCluster object, +they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the cluster). + Example: ```yaml kind: MyProviderCluster diff --git a/docs/book/src/developer/architecture/controllers/control-plane.md b/docs/book/src/developer/architecture/controllers/control-plane.md index 94637ef610a4..64084f33c5b8 100644 --- a/docs/book/src/developer/architecture/controllers/control-plane.md +++ b/docs/book/src/developer/architecture/controllers/control-plane.md @@ -234,6 +234,9 @@ The `status` object **may** define several fields: exist in the cluster. For example, managed control plane providers for AKS, EKS, GKE, etc, should set this to `true`. Leaving the field undefined is equivalent to setting the value to `false`. +Note: once any of `failureReason` or `failureMessage` surface on the cluster who is referencing the control plane object, +they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the cluster). + ## Example usage ```yaml diff --git a/docs/book/src/developer/architecture/controllers/machine-pool.md b/docs/book/src/developer/architecture/controllers/machine-pool.md index 6dc717bff70c..0ce38cb05d20 100644 --- a/docs/book/src/developer/architecture/controllers/machine-pool.md +++ b/docs/book/src/developer/architecture/controllers/machine-pool.md @@ -61,6 +61,9 @@ The `status` object **may** define several fields that do not affect functionali * `failureReason` - a string field explaining why a fatal error has occurred, if possible. * `failureMessage` - a string field that holds the message contained by the error. +Note: once any of `failureReason` or `failureMessage` surface on the machine pool who is referencing the bootstrap config object, +they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine pool). + Example: ```yaml @@ -97,7 +100,10 @@ The `status` object **may** define several fields that do not affect functionali * `failureMessage` - is a string that holds the message contained by the error. * `infrastructureMachineKind` - the kind of the InfraMachines. This should be set if the InfrastructureMachinePool plans to support MachinePool Machines. -**Note:** Infrastructure providers can support MachinePool Machines by having the InfraMachinePool set the `infrastructureMachineKind` to the kind of their InfrastructureMachines. The InfrastructureMachinePool will be responsible for creating InfrastructureMachines as the MachinePool is scaled up, and the MachinePool controller will create Machines for each InfrastructureMachine and set the ownerRef. The InfrastructureMachinePool will be responsible for deleting the Machines as the MachinePool is scaled down in order for the Machine deletion workflow to function properly. In addition, the InfrastructureMachines must also have the following labels set by the InfrastructureMachinePool: `cluster.x-k8s.io/cluster-name` and `cluster.x-k8s.io/pool-name`. The `MachinePoolNameLabel` must also be formatted with `capilabels.MustFormatValue()` so that it will not exceed character limits. +Note: once any of `failureReason` or `failureMessage` surface on the machine pool who is referencing the InfrastructureMachinePool object, +they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine pool). + +Note: Infrastructure providers can support MachinePool Machines by having the InfraMachinePool set the `infrastructureMachineKind` to the kind of their InfrastructureMachines. The InfrastructureMachinePool will be responsible for creating InfrastructureMachines as the MachinePool is scaled up, and the MachinePool controller will create Machines for each InfrastructureMachine and set the ownerRef. The InfrastructureMachinePool will be responsible for deleting the Machines as the MachinePool is scaled down in order for the Machine deletion workflow to function properly. In addition, the InfrastructureMachines must also have the following labels set by the InfrastructureMachinePool: `cluster.x-k8s.io/cluster-name` and `cluster.x-k8s.io/pool-name`. The `MachinePoolNameLabel` must also be formatted with `capilabels.MustFormatValue()` so that it will not exceed character limits. Example ```yaml diff --git a/docs/book/src/developer/architecture/controllers/machine.md b/docs/book/src/developer/architecture/controllers/machine.md index 679a9cb6f5cc..bd318a381593 100644 --- a/docs/book/src/developer/architecture/controllers/machine.md +++ b/docs/book/src/developer/architecture/controllers/machine.md @@ -61,6 +61,10 @@ The `status` object **may** define several fields that do not affect functionali * `failureReason` - a string field explaining why a fatal error has occurred, if possible. * `failureMessage` - a string field that holds the message contained by the error. +Note: once any of `failureReason` or `failureMessage` surface on the machine who is referencing the bootstrap config object, +they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine). +Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated. + Example: ```yaml @@ -105,6 +109,10 @@ defined as: - `type` (string): one of `Hostname`, `ExternalIP`, `InternalIP`, `ExternalDNS`, `InternalDNS` - `address` (string) +Note: once any of `failureReason` or `failureMessage` surface on the machine who is referencing the infrastructureMachine object, +they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine). +Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated. + Example: ```yaml kind: MyMachine diff --git a/docs/book/src/developer/providers/bootstrap.md b/docs/book/src/developer/providers/bootstrap.md index 21ef14cf4d8a..55354e6edae5 100644 --- a/docs/book/src/developer/providers/bootstrap.md +++ b/docs/book/src/developer/providers/bootstrap.md @@ -27,6 +27,10 @@ A bootstrap provider must define an API type for bootstrap resources. The type: 2. `failureMessage` (string): indicates there is a fatal problem reconciling the bootstrap data; meant to be a more descriptive value than `failureReason` +Note: once any of `failureReason` or `failureMessage` surface on the machine/machine pool who is referencing the bootstrap config object, +they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine/machine pool). +Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated. + Note: because the `dataSecretName` is part of `status`, this value must be deterministically recreatable from the data in the `Cluster`, `Machine`, and/or bootstrap resource. If the name is randomly generated, it is not always possible to move the resource and its associated secret from one management cluster to another. diff --git a/docs/book/src/developer/providers/cluster-infrastructure.md b/docs/book/src/developer/providers/cluster-infrastructure.md index 907ca246fd59..249b2abc247a 100644 --- a/docs/book/src/developer/providers/cluster-infrastructure.md +++ b/docs/book/src/developer/providers/cluster-infrastructure.md @@ -36,6 +36,9 @@ A cluster infrastructure provider must define an API type for "infrastructure cl - `controlPlane` (bool): indicates if failure domain is appropriate for running control plane instances. - `attributes` (`map[string]string`): arbitrary attributes for users to apply to a failure domain. +Note: once any of `failureReason` or `failureMessage` surface on the cluster who is referencing the infrastructureCluster object, +they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the cluster). + ### InfraClusterTemplate Resources For a given InfraCluster resource, you should also add a corresponding InfraClusterTemplate resources: diff --git a/docs/book/src/developer/providers/machine-infrastructure.md b/docs/book/src/developer/providers/machine-infrastructure.md index db4cae023a14..75177468c8eb 100644 --- a/docs/book/src/developer/providers/machine-infrastructure.md +++ b/docs/book/src/developer/providers/machine-infrastructure.md @@ -45,6 +45,9 @@ A machine infrastructure provider must define an API type for "infrastructure ma 7. Should have a conditions field with the following: 1. A Ready condition to represent the overall operational state of the component. It can be based on the summary of more detailed conditions existing on the same object, e.g. instanceReady, SecurityGroupsReady conditions. +Note: once any of `failureReason` or `failureMessage` surface on the machine who is referencing the infrastructureMachine object, +they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine). +Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated. ### InfraMachineTemplate Resources diff --git a/docs/book/src/tasks/automated-machine-management/healthchecking.md b/docs/book/src/tasks/automated-machine-management/healthchecking.md index 1f1af85772af..117b3bdb0744 100644 --- a/docs/book/src/tasks/automated-machine-management/healthchecking.md +++ b/docs/book/src/tasks/automated-machine-management/healthchecking.md @@ -20,7 +20,7 @@ A MachineHealthCheck is a resource within the Cluster API which allows users to A MachineHealthCheck is defined on a management cluster and scoped to a particular workload cluster. When defining a MachineHealthCheck, users specify a timeout for each of the conditions that they define to check on the Machine's Node. -If any of these conditions are met for the duration of the timeout, the Machine will be remediated. +If any of these conditions are met for the duration of the timeout, the Machine will be remediated. Also, Machines with `failureMessage` or `failureMessage` (terminal failures) are automatically remediated. By default, the action of remediating a Machine should trigger a new Machine to be created to replace the failed one, but providers are allowed to plug in more sophisticated external remediation solutions. ## Creating a MachineHealthCheck