Skip to content

Commit

Permalink
📖 Document failureReason and Message are considered terminal errors (#…
Browse files Browse the repository at this point in the history
…10561)

* Document failureReason and Message are considered terminal errors

* Address comments

* Clarify what cannot be restored anymore means
  • Loading branch information
fabriziopandini authored May 9, 2024
1 parent 8e72a0e commit ab02db3
Show file tree
Hide file tree
Showing 8 changed files with 32 additions and 2 deletions.
3 changes: 3 additions & 0 deletions docs/book/src/developer/architecture/controllers/cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,9 @@ is a map, defined as `map[string]FailureDomainSpec`. A unique key must be used f
- `controlPlane` (bool): indicates if failure domain is appropriate for running control plane instances.
- `attributes` (`map[string]string`): arbitrary attributes for users to apply to a failure domain.

Note: once any of `failureReason` or `failureMessage` surface on the cluster who is referencing the infrastructureCluster object,
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the cluster).

Example:
```yaml
kind: MyProviderCluster
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,9 @@ The `status` object **may** define several fields:
exist in the cluster. For example, managed control plane providers for AKS, EKS, GKE, etc, should
set this to `true`. Leaving the field undefined is equivalent to setting the value to `false`.

Note: once any of `failureReason` or `failureMessage` surface on the cluster who is referencing the control plane object,
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the cluster).

## Example usage

```yaml
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,9 @@ The `status` object **may** define several fields that do not affect functionali
* `failureReason` - a string field explaining why a fatal error has occurred, if possible.
* `failureMessage` - a string field that holds the message contained by the error.

Note: once any of `failureReason` or `failureMessage` surface on the machine pool who is referencing the bootstrap config object,
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine pool).

Example:

```yaml
Expand Down Expand Up @@ -97,7 +100,10 @@ The `status` object **may** define several fields that do not affect functionali
* `failureMessage` - is a string that holds the message contained by the error.
* `infrastructureMachineKind` - the kind of the InfraMachines. This should be set if the InfrastructureMachinePool plans to support MachinePool Machines.

**Note:** Infrastructure providers can support MachinePool Machines by having the InfraMachinePool set the `infrastructureMachineKind` to the kind of their InfrastructureMachines. The InfrastructureMachinePool will be responsible for creating InfrastructureMachines as the MachinePool is scaled up, and the MachinePool controller will create Machines for each InfrastructureMachine and set the ownerRef. The InfrastructureMachinePool will be responsible for deleting the Machines as the MachinePool is scaled down in order for the Machine deletion workflow to function properly. In addition, the InfrastructureMachines must also have the following labels set by the InfrastructureMachinePool: `cluster.x-k8s.io/cluster-name` and `cluster.x-k8s.io/pool-name`. The `MachinePoolNameLabel` must also be formatted with `capilabels.MustFormatValue()` so that it will not exceed character limits.
Note: once any of `failureReason` or `failureMessage` surface on the machine pool who is referencing the InfrastructureMachinePool object,
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine pool).

Note: Infrastructure providers can support MachinePool Machines by having the InfraMachinePool set the `infrastructureMachineKind` to the kind of their InfrastructureMachines. The InfrastructureMachinePool will be responsible for creating InfrastructureMachines as the MachinePool is scaled up, and the MachinePool controller will create Machines for each InfrastructureMachine and set the ownerRef. The InfrastructureMachinePool will be responsible for deleting the Machines as the MachinePool is scaled down in order for the Machine deletion workflow to function properly. In addition, the InfrastructureMachines must also have the following labels set by the InfrastructureMachinePool: `cluster.x-k8s.io/cluster-name` and `cluster.x-k8s.io/pool-name`. The `MachinePoolNameLabel` must also be formatted with `capilabels.MustFormatValue()` so that it will not exceed character limits.

Example
```yaml
Expand Down
8 changes: 8 additions & 0 deletions docs/book/src/developer/architecture/controllers/machine.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,10 @@ The `status` object **may** define several fields that do not affect functionali
* `failureReason` - a string field explaining why a fatal error has occurred, if possible.
* `failureMessage` - a string field that holds the message contained by the error.

Note: once any of `failureReason` or `failureMessage` surface on the machine who is referencing the bootstrap config object,
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine).
Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated.

Example:

```yaml
Expand Down Expand Up @@ -105,6 +109,10 @@ defined as:
- `type` (string): one of `Hostname`, `ExternalIP`, `InternalIP`, `ExternalDNS`, `InternalDNS`
- `address` (string)

Note: once any of `failureReason` or `failureMessage` surface on the machine who is referencing the infrastructureMachine object,
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine).
Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated.

Example:
```yaml
kind: MyMachine
Expand Down
4 changes: 4 additions & 0 deletions docs/book/src/developer/providers/bootstrap.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,10 @@ A bootstrap provider must define an API type for bootstrap resources. The type:
2. `failureMessage` (string): indicates there is a fatal problem reconciling the bootstrap data;
meant to be a more descriptive value than `failureReason`

Note: once any of `failureReason` or `failureMessage` surface on the machine/machine pool who is referencing the bootstrap config object,
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine/machine pool).
Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated.

Note: because the `dataSecretName` is part of `status`, this value must be deterministically recreatable from the data in the
`Cluster`, `Machine`, and/or bootstrap resource. If the name is randomly generated, it is not always possible to move
the resource and its associated secret from one management cluster to another.
Expand Down
3 changes: 3 additions & 0 deletions docs/book/src/developer/providers/cluster-infrastructure.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,9 @@ A cluster infrastructure provider must define an API type for "infrastructure cl
- `controlPlane` (bool): indicates if failure domain is appropriate for running control plane instances.
- `attributes` (`map[string]string`): arbitrary attributes for users to apply to a failure domain.

Note: once any of `failureReason` or `failureMessage` surface on the cluster who is referencing the infrastructureCluster object,
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the cluster).

### InfraClusterTemplate Resources

For a given InfraCluster resource, you should also add a corresponding InfraClusterTemplate resources:
Expand Down
3 changes: 3 additions & 0 deletions docs/book/src/developer/providers/machine-infrastructure.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,9 @@ A machine infrastructure provider must define an API type for "infrastructure ma
7. Should have a conditions field with the following:
1. A Ready condition to represent the overall operational state of the component. It can be based on the summary of more detailed conditions existing on the same object, e.g. instanceReady, SecurityGroupsReady conditions.

Note: once any of `failureReason` or `failureMessage` surface on the machine who is referencing the infrastructureMachine object,
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine).
Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated.

### InfraMachineTemplate Resources

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ A MachineHealthCheck is a resource within the Cluster API which allows users to
A MachineHealthCheck is defined on a management cluster and scoped to a particular workload cluster.

When defining a MachineHealthCheck, users specify a timeout for each of the conditions that they define to check on the Machine's Node.
If any of these conditions are met for the duration of the timeout, the Machine will be remediated.
If any of these conditions are met for the duration of the timeout, the Machine will be remediated. Also, Machines with `failureMessage` or `failureMessage` (terminal failures) are automatically remediated.
By default, the action of remediating a Machine should trigger a new Machine to be created to replace the failed one, but providers are allowed to plug in more sophisticated external remediation solutions.

## Creating a MachineHealthCheck
Expand Down

0 comments on commit ab02db3

Please sign in to comment.