focus on reasons

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>
stolostron · Jun 19, 2024 · d4a60f2 · d4a60f2
1 parent 78c4738
commit d4a60f2
Showing 1 changed file with 57 additions and 70 deletions.
diff --git a/operators/endpointmetrics/doc/design/addon-status.md b/operators/endpointmetrics/doc/design/addon-status.md
@@ -13,22 +13,15 @@ The status is reported through the standard `status.Conditions` structure in the
 - **reason**: A short, machine-readable string that gives the reason for the condition's last transition.
 - **message**: A human-readable message providing details about the condition.
 
-In our case, we use the following types of conditions:
+[Three standard types](https://open-cluster-management.io/concepts/addon#add-on-healthiness) of conditions are used to report the status of the addon:
 
 - **Available**: The addon is available and running, including the UWL metrics collector if configured.
 - **Degraded**: The addon is not working as expected. Either metrics are not being forwarded or the addon update has failed.
-- **Disabled**: The addon is disabled.
-- **NotSupported**: The addon is not supported on the current platform. This happens when prometheus is not being installed but the cluster is not OCP.
 - **Progressing**: The addon is being installed or has been updated.
 
-Finally, the types can be associated to the following reasons:
+Only one of these standard types can have the `True` status at a time. They are mutually exclusive. 
 
-- **DisabledMetrics**: The metrics collector is disabled. It leads to the `Disabled` status. 
-- **ForwardFailure**: The metrics collector failed to forward metrics. It leads to the `Degraded` status.
-- **ForwardSuccessful**: The metrics collector successfully forwarded metrics. It leads to the `Available` status.
-- **NotSupported**: The platform is not supported. It leads to the `NotSupported` status.
-- **UpdateFailure**: The endpoint operator failed to update the metrics collector. It leads to the `Degraded` status.
-- **UpdateSuccesful**: The endpoint operator successfully updated the metrics collector. It leads to the `Progressing` status.
+Those types are supported by standard reasons explaining the reason for the condition's last transition. The reasons are defined later in this document.
 
 ## Condition List Management
 
@@ -38,24 +31,30 @@ If the most recent condition is identical to the new condition, the status contr
 
 ## Status Synchronization
 
-The following diagram shows the actors involved in the status synchronization process and their high level interactions:
+The diagram below shows the actors involved in the status synchronization process and their high level interactions. 
+
+The hub cluster also deploys the metrics-collector, but it does not rely on the addon feature. Thus, the addon status reporting described here does not apply for the hub cluster addon.
 
 ```mermaid
 sequenceDiagram
     participant EndpointController
     participant MetricsCollector
     participant UwlMetricsCollector
     participant StatusController
-    participant SpokeAddon
-    participant HubAddon
-    EndpointController--)SpokeAddon: Updates 
-    MetricsCollector--)SpokeAddon: Updates 
-    UwlMetricsCollector--)SpokeAddon: Updates 
-    StatusController--)SpokeAddon: Watches and updates 
-    StatusController--)HubAddon: Replicates spoke addon status
+    participant SpokeObsAddon
+    participant HubObsAddon
+    participant MCO
+    participant ManagedClusterAddOn
+    EndpointController--)SpokeObsAddon: Updates 
+    MetricsCollector--)SpokeObsAddon: Updates 
+    UwlMetricsCollector--)SpokeObsAddon: Updates 
+    StatusController--)SpokeObsAddon: Watches and updates 
+    StatusController--)HubObsAddon: Replicates spoke addon status
+    MCO--)HubObsAddon: Watches
+    MCO--)ManagedClusterAddOn: Replicates HubObsAddon status
 ```
 
-The hub cluster also deploys the metrics-collector, but it does not rely on the addon feature. Thus, addon status reporting does not apply for the hub cluster addon.
+Ultimately, the status displyed in the hub cluster for the observability addon is the status of the **ManagedClusterAddOn** CR named **observability-controller**.
 
 ## Addon State Management
 
@@ -72,29 +71,25 @@ The addon state is an aggregation of the states of the metrics collector and the
 
 ### Defining the Aggregated State Accurately
 
-To accurately reflect the aggregated state, we introduce two additional condition types specific to the collectors: **MetricsCollectorStatus** and **UwlMetricsCollectorStatus**. These conditions are updated by the metrics collector, the UWL metrics collector, and the endpoint controller. The status controller then aggregates these specific conditions to update the standard conditions of the addon status. A predicate function ensures that the status controller is not triggered by its own updates.
-
-These special condition types store the latest reasons for the metrics collector and the UWL metrics collector. These reasons can be mapped to the standard types, which the status controller uses to update the addon status.
-
-In this context, we ignore the **Disabled** and **NotSupported** states, which are set solely by the endpoint operator and do not require aggregation.
+To accurately reflect the aggregated state, we introduce two additional condition types specific to the collectors: **MetricsCollectorStatus** and **UwlMetricsCollectorStatus**. These conditions are updated by the metrics collector, the UWL metrics collector, and the endpoint controller. The status controller then aggregates these specific conditions to update the standard condition types of the addon status. A predicate function ensures that the status controller is not triggered by its own updates.
 
-Below are the individual reasons for the metrics collector and the UWL metrics collector, along with the resulting aggregated reasons:
-
-| Reasons                  | ForwardFailure | ForwardSuccessful | UpdateFailure | UpdateSuccesful | 
-|--------------------------|----------------|--------------|-------------------|-------------------|
-| ForwardFailure           | ForwardFailure | x | x    | x    | 
-| ForwardSuccessful        | ForwardFailure | ForwardSuccessful | x    | x    |
-| UpdateFailure            | ForwardFailure | UpdateFailure | UpdateFailure    | x    |
-| UpdateSuccesful          | ForwardFailure | UpdateSuccesful | UpdateFailure    | UpdateSuccesful    |
-
-Then, the aggregated reasons are mapped to the standard types:
+These special condition types store the latest reasons for the metrics collector and the UWL metrics collector. These reasons can be mapped to the standard types, which the status controller uses to update the addon status:
 
 - **ForwardFailure** reason corresponds to **Degraded** status type.
 - **ForwardSuccessful** reason corresponds to **Available** status type.
 - **UpdateFailure** reason corresponds to **Degraded** status type.
 - **UpdateSuccesful** reason corresponds to **Progressing** status type.
 
-Finally, individual state details of each collector can be explicitly set in the message field of the aggregated condition.
+We set a priority for the reasons to ensure that the most critical reason is reflected in the aggregated state. The priority is as follows:
+
+1. **UpdateFailure**
+2. **ForwardFailure**
+3. **UpdateSuccesful**
+4. **ForwardSuccessful**
+
+The aggregated state is then determined by the highest priority reason. For example, if the condition type MetricsCollectorStatus' reason is **ForwardFailure** and the condition type UwlMetricsCollectorStatus' reason is **ForwardSuccessful**, the aggregated reason is **ForwardFailure** and the aggregated type is **Degraded**.
+
+Finally, individual state details of each collector can be explicitly set in the message field of the aggregated condition by the status controller.
 
 Sequentially, at the boot stage with both collectors, this would result in:
 
@@ -108,9 +103,10 @@ sequenceDiagram
     participant HubAddon
     participant ObservatoriumAPI
     EndpointController->>MetricsCollector: Deploys
+    EndpointController->>SpokeAddon: Updates MetricsCollectorStatus condition with UpdateSuccesful
     EndpointController->>UwlMetricsCollector: Deploys
-    EndpointController->>SpokeAddon: Creates MetricsCollectorStatus and UwlMetricsCollectorStatus conditions with UpdateSuccesful
-    StatusController->>SpokeAddon: Updates the Progressing condition with UpdateSuccesful
+    EndpointController->>SpokeAddon: Updates UwlMetricsCollectorStatus condition with UpdateSuccesful
+    StatusController->>SpokeAddon: Updates the standard Progressing condition with UpdateSuccesful
     StatusController->>HubAddon: Replicates spoke addon status
     MetricsCollector->>ObservatoriumAPI: Forwards metrics successfully
     MetricsCollector->>SpokeAddon: Updates MetricsCollectorStatus condition with ForwardSuccessful
@@ -126,51 +122,42 @@ Excluded solutions:
 
 - Making the metrics collector aware of the UWL metrics collector. This approach leads to additional complexity with more reasons, more transition rules between states, and tighter coupling between the two components. Additionally, it would cause the metrics collector to restart with each configuration change of the UWL metrics collector (activation/deactivation).
 
+### Reasons managers
 
-### Ensuring Consistent Individual States
+The following table maps the reasons to the actors that manage them and the condition types they update:
 
-At this stage, individual states are still subject to flapping, as is the aggregated state. This is especially true when the endpoint operator fails to update a collector while it is still forwarding metrics, causing the state to flip between **Degraded** and **Available**.
+| Condition Type \ Actor | Endpoint Operator | Status Controller | Metrics Collector | UWL Metrics Collector |
+|----------------|-------------------|-------------------|-------------------|-----------------------|
+| MetricsCollectorStatus | UpdateSuccesful <br />UpdateFailure | | ForwardSuccessful <br />ForwardFailure | | 
+| UwlMetricsCollectorStatus | UpdateSuccesful <br />UpdateFailure | | | ForwardSuccessful <br />ForwardFailure |
+| Progressing | | UpdateSuccesful | | |
+| Available | | ForwardSuccessful | | |
+| Degraded | Disabled <br /> NotSupported | ForwardFailure <br /> UpdateFailure | |
 
-To ensure a consistent state, we apply strict transition rules between states. Transitioning to a new state requires verifying the reason for the current state. If the reason is not compatible with the new reason/state, the transition is not allowed. For example, if the reason for the current state is **Degraded** due to **UpdateFailure**, the transition to **Available** is not permitted. The endpoint operator must first update the collector successfully and transition to **Progressing** with **UpdateSuccesful** before the collector can transition to **Available** with **ForwardSuccessful**.
+### Ensuring Consistent Individual States
 
-This is illustrated in the state diagrams below with following notations:
+At this stage, individual states are still subject to flapping, as is the aggregated state. This is especially true when the endpoint operator fails to update a collector while it is still forwarding metrics, causing the reason to flip between **UpdateFailure** and **ForwardSuccessful**. The aggregated state would then flip between **Degraded** and **Available**.
 
-- Reasons for the transitions are the edges.
-- States are the nodes.
-- When a transition is only possible under certain conditions, the condition is in square brackets. e.g. `UpdateSuccesful [UpdateFailure]` means that the transition is only possible if the reason for the preceding transition is `UpdateFailure`.
+To ensure a consistent state, we apply strict transition rules between reasons. If the current reason is not compatible with the new reason/state, the transition is not allowed. For example, if the current reason for the condition type **MetricsCollectorStatus** is **UpdateFailure**, the transition to **ForwardSuccessful** is not permitted. The endpoint operator must first update the collector successfully and transition to **UpdateSuccesful** before the collector can transition to **ForwardSuccessful**.
 
-### Endpoint metrics operator
+The following state diagram illustrates the allowed transitions between reasons:
 
 ```mermaid
 stateDiagram-v2
-    [*] --> Progressing: endpoint-operator starts
-    Progressing --> Degraded: UpdateFailure
-    Progressing --> Disabled: DisabledMetrics
-    Available --> Progressing: UpdateSuccesful
-    Degraded --> Disabled: DisabledMetrics
-    Degraded --> Progressing: UpdateSuccesful
-    Progressing --> NotSupported: NotSupported
-    Available --> Degraded: UpdateFailure
-    Available --> Disabled: DisabledMetrics
+    UpdateSuccesful --> ForwardSuccessful
+    ForwardSuccessful --> UpdateSuccesful
+    UpdateFailure --> UpdateSuccesful
+    UpdateSuccesful --> UpdateFailure
+    ForwardSuccessful --> UpdateFailure
+    ForwardFailure --> UpdateFailure
+    ForwardFailure --> ForwardSuccessful
+    ForwardFailure --> UpdateSuccesful
+    ForwardSuccessful --> ForwardFailure
 ```
 
 Notes:
 
-- The transition `Available -> Progressing` is only triggered when the **deployement** resource is updated or created. For example, we avoid flipping the state if only the service resource is updated, to prevent unnecessary and confusing state changes.
-- To keep the state diagram readable, we have omitted unlocking transitions from the supposed final state `NotSupported` to  `Disabled`, `Progressing` and `Degraded`. In practice, this state may transition if the cluster type is incorrectly identified and subsequently fixed. Also transitions from `Disabled` to `Progressing` or `Degraded` are omitted for the same reason.
+- The transitions toward the reason **UpdateSuccesful** are only triggered when the **deployement** resource is updated or created. For example, we avoid flipping the state if only the service resource is updated, to prevent unnecessary and confusing state changes.
+- To keep the state diagram readable, we have omitted **NotSupported** and **Disabled** reasons. They are directly set on the standard **Degraded** type by the endpoint operator. And in that case, there is no aggragation of atomic metric collectors states as they are not deployed. There is no restriction to transition toward these reasons. Transitions from these reasons are restricted to **UpdateSuccesful** or **UpdateFailure**. 
 
-### Both Metrics collectors
-
-```mermaid
-stateDiagram-v2
-    [*] --> Progressing: metrics-collector starts
-    Progressing --> Degraded: ForwardFailure
-    Progressing --> Available: ForwardSuccessful
-    Degraded --> Available: ForwardSuccessful [ForwardFailure]
-    Available --> Degraded: ForwardFailure
-```
-
-Notes:
 
-- The transition `Degraded -> Available` can only happen if the metrics collector is responsible for the degraded state, i.e. the reason for the degraded state is `ForwardFailure`.
-- The transition `Available -> Degraded` only happens if the metrics collector fails to forward metrics over a certain period of time to avoid flapping state.