Skip to content

Commit

Permalink
focus on reasons
Browse files Browse the repository at this point in the history
Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>
  • Loading branch information
thibaultmg committed Jun 19, 2024
1 parent 78c4738 commit d4a60f2
Showing 1 changed file with 57 additions and 70 deletions.
127 changes: 57 additions & 70 deletions operators/endpointmetrics/doc/design/addon-status.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,22 +13,15 @@ The status is reported through the standard `status.Conditions` structure in the
- **reason**: A short, machine-readable string that gives the reason for the condition's last transition.
- **message**: A human-readable message providing details about the condition.

In our case, we use the following types of conditions:
[Three standard types](https://open-cluster-management.io/concepts/addon#add-on-healthiness) of conditions are used to report the status of the addon:

- **Available**: The addon is available and running, including the UWL metrics collector if configured.
- **Degraded**: The addon is not working as expected. Either metrics are not being forwarded or the addon update has failed.
- **Disabled**: The addon is disabled.
- **NotSupported**: The addon is not supported on the current platform. This happens when prometheus is not being installed but the cluster is not OCP.
- **Progressing**: The addon is being installed or has been updated.

Finally, the types can be associated to the following reasons:
Only one of these standard types can have the `True` status at a time. They are mutually exclusive.

- **DisabledMetrics**: The metrics collector is disabled. It leads to the `Disabled` status.
- **ForwardFailure**: The metrics collector failed to forward metrics. It leads to the `Degraded` status.
- **ForwardSuccessful**: The metrics collector successfully forwarded metrics. It leads to the `Available` status.
- **NotSupported**: The platform is not supported. It leads to the `NotSupported` status.
- **UpdateFailure**: The endpoint operator failed to update the metrics collector. It leads to the `Degraded` status.
- **UpdateSuccesful**: The endpoint operator successfully updated the metrics collector. It leads to the `Progressing` status.
Those types are supported by standard reasons explaining the reason for the condition's last transition. The reasons are defined later in this document.

## Condition List Management

Expand All @@ -38,24 +31,30 @@ If the most recent condition is identical to the new condition, the status contr

## Status Synchronization

The following diagram shows the actors involved in the status synchronization process and their high level interactions:
The diagram below shows the actors involved in the status synchronization process and their high level interactions.

The hub cluster also deploys the metrics-collector, but it does not rely on the addon feature. Thus, the addon status reporting described here does not apply for the hub cluster addon.

```mermaid
sequenceDiagram
participant EndpointController
participant MetricsCollector
participant UwlMetricsCollector
participant StatusController
participant SpokeAddon
participant HubAddon
EndpointController--)SpokeAddon: Updates
MetricsCollector--)SpokeAddon: Updates
UwlMetricsCollector--)SpokeAddon: Updates
StatusController--)SpokeAddon: Watches and updates
StatusController--)HubAddon: Replicates spoke addon status
participant SpokeObsAddon
participant HubObsAddon
participant MCO
participant ManagedClusterAddOn
EndpointController--)SpokeObsAddon: Updates
MetricsCollector--)SpokeObsAddon: Updates
UwlMetricsCollector--)SpokeObsAddon: Updates
StatusController--)SpokeObsAddon: Watches and updates
StatusController--)HubObsAddon: Replicates spoke addon status
MCO--)HubObsAddon: Watches
MCO--)ManagedClusterAddOn: Replicates HubObsAddon status
```

The hub cluster also deploys the metrics-collector, but it does not rely on the addon feature. Thus, addon status reporting does not apply for the hub cluster addon.
Ultimately, the status displyed in the hub cluster for the observability addon is the status of the **ManagedClusterAddOn** CR named **observability-controller**.

## Addon State Management

Expand All @@ -72,29 +71,25 @@ The addon state is an aggregation of the states of the metrics collector and the

### Defining the Aggregated State Accurately

To accurately reflect the aggregated state, we introduce two additional condition types specific to the collectors: **MetricsCollectorStatus** and **UwlMetricsCollectorStatus**. These conditions are updated by the metrics collector, the UWL metrics collector, and the endpoint controller. The status controller then aggregates these specific conditions to update the standard conditions of the addon status. A predicate function ensures that the status controller is not triggered by its own updates.

These special condition types store the latest reasons for the metrics collector and the UWL metrics collector. These reasons can be mapped to the standard types, which the status controller uses to update the addon status.

In this context, we ignore the **Disabled** and **NotSupported** states, which are set solely by the endpoint operator and do not require aggregation.
To accurately reflect the aggregated state, we introduce two additional condition types specific to the collectors: **MetricsCollectorStatus** and **UwlMetricsCollectorStatus**. These conditions are updated by the metrics collector, the UWL metrics collector, and the endpoint controller. The status controller then aggregates these specific conditions to update the standard condition types of the addon status. A predicate function ensures that the status controller is not triggered by its own updates.

Below are the individual reasons for the metrics collector and the UWL metrics collector, along with the resulting aggregated reasons:

| Reasons | ForwardFailure | ForwardSuccessful | UpdateFailure | UpdateSuccesful |
|--------------------------|----------------|--------------|-------------------|-------------------|
| ForwardFailure | ForwardFailure | x | x | x |
| ForwardSuccessful | ForwardFailure | ForwardSuccessful | x | x |
| UpdateFailure | ForwardFailure | UpdateFailure | UpdateFailure | x |
| UpdateSuccesful | ForwardFailure | UpdateSuccesful | UpdateFailure | UpdateSuccesful |

Then, the aggregated reasons are mapped to the standard types:
These special condition types store the latest reasons for the metrics collector and the UWL metrics collector. These reasons can be mapped to the standard types, which the status controller uses to update the addon status:

- **ForwardFailure** reason corresponds to **Degraded** status type.
- **ForwardSuccessful** reason corresponds to **Available** status type.
- **UpdateFailure** reason corresponds to **Degraded** status type.
- **UpdateSuccesful** reason corresponds to **Progressing** status type.

Finally, individual state details of each collector can be explicitly set in the message field of the aggregated condition.
We set a priority for the reasons to ensure that the most critical reason is reflected in the aggregated state. The priority is as follows:

1. **UpdateFailure**
2. **ForwardFailure**
3. **UpdateSuccesful**
4. **ForwardSuccessful**

The aggregated state is then determined by the highest priority reason. For example, if the condition type MetricsCollectorStatus' reason is **ForwardFailure** and the condition type UwlMetricsCollectorStatus' reason is **ForwardSuccessful**, the aggregated reason is **ForwardFailure** and the aggregated type is **Degraded**.

Finally, individual state details of each collector can be explicitly set in the message field of the aggregated condition by the status controller.

Sequentially, at the boot stage with both collectors, this would result in:

Expand All @@ -108,9 +103,10 @@ sequenceDiagram
participant HubAddon
participant ObservatoriumAPI
EndpointController->>MetricsCollector: Deploys
EndpointController->>SpokeAddon: Updates MetricsCollectorStatus condition with UpdateSuccesful
EndpointController->>UwlMetricsCollector: Deploys
EndpointController->>SpokeAddon: Creates MetricsCollectorStatus and UwlMetricsCollectorStatus conditions with UpdateSuccesful
StatusController->>SpokeAddon: Updates the Progressing condition with UpdateSuccesful
EndpointController->>SpokeAddon: Updates UwlMetricsCollectorStatus condition with UpdateSuccesful
StatusController->>SpokeAddon: Updates the standard Progressing condition with UpdateSuccesful
StatusController->>HubAddon: Replicates spoke addon status
MetricsCollector->>ObservatoriumAPI: Forwards metrics successfully
MetricsCollector->>SpokeAddon: Updates MetricsCollectorStatus condition with ForwardSuccessful
Expand All @@ -126,51 +122,42 @@ Excluded solutions:

- Making the metrics collector aware of the UWL metrics collector. This approach leads to additional complexity with more reasons, more transition rules between states, and tighter coupling between the two components. Additionally, it would cause the metrics collector to restart with each configuration change of the UWL metrics collector (activation/deactivation).

### Reasons managers

### Ensuring Consistent Individual States
The following table maps the reasons to the actors that manage them and the condition types they update:

At this stage, individual states are still subject to flapping, as is the aggregated state. This is especially true when the endpoint operator fails to update a collector while it is still forwarding metrics, causing the state to flip between **Degraded** and **Available**.
| Condition Type \ Actor | Endpoint Operator | Status Controller | Metrics Collector | UWL Metrics Collector |
|----------------|-------------------|-------------------|-------------------|-----------------------|
| MetricsCollectorStatus | UpdateSuccesful <br />UpdateFailure | | ForwardSuccessful <br />ForwardFailure | |
| UwlMetricsCollectorStatus | UpdateSuccesful <br />UpdateFailure | | | ForwardSuccessful <br />ForwardFailure |
| Progressing | | UpdateSuccesful | | |
| Available | | ForwardSuccessful | | |
| Degraded | Disabled <br /> NotSupported | ForwardFailure <br /> UpdateFailure | |

To ensure a consistent state, we apply strict transition rules between states. Transitioning to a new state requires verifying the reason for the current state. If the reason is not compatible with the new reason/state, the transition is not allowed. For example, if the reason for the current state is **Degraded** due to **UpdateFailure**, the transition to **Available** is not permitted. The endpoint operator must first update the collector successfully and transition to **Progressing** with **UpdateSuccesful** before the collector can transition to **Available** with **ForwardSuccessful**.
### Ensuring Consistent Individual States

This is illustrated in the state diagrams below with following notations:
At this stage, individual states are still subject to flapping, as is the aggregated state. This is especially true when the endpoint operator fails to update a collector while it is still forwarding metrics, causing the reason to flip between **UpdateFailure** and **ForwardSuccessful**. The aggregated state would then flip between **Degraded** and **Available**.

- Reasons for the transitions are the edges.
- States are the nodes.
- When a transition is only possible under certain conditions, the condition is in square brackets. e.g. `UpdateSuccesful [UpdateFailure]` means that the transition is only possible if the reason for the preceding transition is `UpdateFailure`.
To ensure a consistent state, we apply strict transition rules between reasons. If the current reason is not compatible with the new reason/state, the transition is not allowed. For example, if the current reason for the condition type **MetricsCollectorStatus** is **UpdateFailure**, the transition to **ForwardSuccessful** is not permitted. The endpoint operator must first update the collector successfully and transition to **UpdateSuccesful** before the collector can transition to **ForwardSuccessful**.

### Endpoint metrics operator
The following state diagram illustrates the allowed transitions between reasons:

```mermaid
stateDiagram-v2
[*] --> Progressing: endpoint-operator starts
Progressing --> Degraded: UpdateFailure
Progressing --> Disabled: DisabledMetrics
Available --> Progressing: UpdateSuccesful
Degraded --> Disabled: DisabledMetrics
Degraded --> Progressing: UpdateSuccesful
Progressing --> NotSupported: NotSupported
Available --> Degraded: UpdateFailure
Available --> Disabled: DisabledMetrics
UpdateSuccesful --> ForwardSuccessful
ForwardSuccessful --> UpdateSuccesful
UpdateFailure --> UpdateSuccesful
UpdateSuccesful --> UpdateFailure
ForwardSuccessful --> UpdateFailure
ForwardFailure --> UpdateFailure
ForwardFailure --> ForwardSuccessful
ForwardFailure --> UpdateSuccesful
ForwardSuccessful --> ForwardFailure
```

Notes:

- The transition `Available -> Progressing` is only triggered when the **deployement** resource is updated or created. For example, we avoid flipping the state if only the service resource is updated, to prevent unnecessary and confusing state changes.
- To keep the state diagram readable, we have omitted unlocking transitions from the supposed final state `NotSupported` to `Disabled`, `Progressing` and `Degraded`. In practice, this state may transition if the cluster type is incorrectly identified and subsequently fixed. Also transitions from `Disabled` to `Progressing` or `Degraded` are omitted for the same reason.
- The transitions toward the reason **UpdateSuccesful** are only triggered when the **deployement** resource is updated or created. For example, we avoid flipping the state if only the service resource is updated, to prevent unnecessary and confusing state changes.
- To keep the state diagram readable, we have omitted **NotSupported** and **Disabled** reasons. They are directly set on the standard **Degraded** type by the endpoint operator. And in that case, there is no aggragation of atomic metric collectors states as they are not deployed. There is no restriction to transition toward these reasons. Transitions from these reasons are restricted to **UpdateSuccesful** or **UpdateFailure**.

### Both Metrics collectors

```mermaid
stateDiagram-v2
[*] --> Progressing: metrics-collector starts
Progressing --> Degraded: ForwardFailure
Progressing --> Available: ForwardSuccessful
Degraded --> Available: ForwardSuccessful [ForwardFailure]
Available --> Degraded: ForwardFailure
```

Notes:

- The transition `Degraded -> Available` can only happen if the metrics collector is responsible for the degraded state, i.e. the reason for the degraded state is `ForwardFailure`.
- The transition `Available -> Degraded` only happens if the metrics collector fails to forward metrics over a certain period of time to avoid flapping state.

0 comments on commit d4a60f2

Please sign in to comment.