diff --git a/docs/rfcs/component-status-reporting-example.png b/docs/rfcs/component-status-reporting-example.png new file mode 100644 index 00000000000..f8fc154f4e5 Binary files /dev/null and b/docs/rfcs/component-status-reporting-example.png differ diff --git a/docs/rfcs/component-status-reporting-fsm.png b/docs/rfcs/component-status-reporting-fsm.png new file mode 100644 index 00000000000..a81ac4b10d7 Binary files /dev/null and b/docs/rfcs/component-status-reporting-fsm.png differ diff --git a/docs/rfcs/component-status-reporting-runtime-updated.png b/docs/rfcs/component-status-reporting-runtime-updated.png new file mode 100644 index 00000000000..b3635243fba Binary files /dev/null and b/docs/rfcs/component-status-reporting-runtime-updated.png differ diff --git a/docs/rfcs/component-status-reporting-runtime.png b/docs/rfcs/component-status-reporting-runtime.png new file mode 100644 index 00000000000..556d443eb93 Binary files /dev/null and b/docs/rfcs/component-status-reporting-runtime.png differ diff --git a/docs/rfcs/component-status-reporting-update.png b/docs/rfcs/component-status-reporting-update.png new file mode 100644 index 00000000000..0407bd1e7f1 Binary files /dev/null and b/docs/rfcs/component-status-reporting-update.png differ diff --git a/docs/rfcs/component-status-reporting.md b/docs/rfcs/component-status-reporting.md index e43d444a9bb..d7ca9892551 100644 --- a/docs/rfcs/component-status-reporting.md +++ b/docs/rfcs/component-status-reporting.md @@ -19,170 +19,154 @@ How to get from the current to desired behavior is also considered out of scope ## The Collector’s Historical method of reporting component health Until recently, the Collector relied on four ways to report health. -First was the `error` returned by the Component’s Start method. During startup, if any component decided to return an error, the Collector would stop gracefully. +1. The `error` returned by the Component’s Start method. During startup, if any component decided to return an error, the Collector would stop gracefully. +2. The `Host.ReportFatalError` method. This method let components tell the Host that something bad happened and the collector needed to shut down. While this method could be used anywhere in the component, it was primarily used with a Component’s Start method to report errors in async work, such as starting a server. + ```golang + if errHTTP := fmr.server.Serve(listener); errHTTP != nil && !errors.Is(errHTTP, http.ErrServerClosed) { + host.ReportFatalError(errHTTP) + } + ``` +3. The error returned by `Shutdown`. This error was indicative that the collector did not cleanly shut down, but did not prevent the shutdown process from moving forward. -Second was `Host.ReportFatalError`. This method let components tell the Host that something bad happened and the collector needed to shut down. While this method could be used anywhere in the component, it was primarily used with a Component’s Start method to report errors in async work, such as starting a server. - -```golang -if errHTTP := fmr.server.Serve(listener); errHTTP != nil && !errors.Is(errHTTP, http.ErrServerClosed) { - host.ReportFatalError(errHTTP) -} -``` -Third was the error returned by `Shutdown`. This error was indicative that the collector did not cleanly shut down, but did not prevent the shutdown process from moving forward. - -Fourth is panicking. During runtime, if the collector experienced an unhandled error, it crashes. +4. Panicking. During runtime, if the collector experienced an unhandled error, it crashes. These are all the way the components in a collector could report that they were unhealthy. -There are two major gaps in the Collector’s historic reporting of component health. First, when a component experienced a transient error, such as an endpoint suddenly not working, the component would simply log the error and return it up the pipeline. There was no mechanism for the component to tell the Host or anything else that something was going wrong. Second, when a component experienced an issue it would never be able to recover from, such as receiving a 404 response from an endpoint, the component would log the error and return it up the pipeline. This situation was handle the same as the transient error, which means the component could not tell the Host or anything else that something was wrong, but worse is that the issue would never get better. - -Edit of above: - There are several major gaps in the Collector’s historic reporting of component health. First, many components return recoverable errors from Start, causing the collector to shutdown, while it could recover if the collector was allowed to run. Second, when a component experienced a transient error, such as an endpoint suddenly not working, the component would simply log the error and return it up the pipeline. There was no mechanism for the component to tell the Host or anything else that something was going wrong. Last, when a component experienced an issue it would never be able to recover from, such as receiving a 404 response from an endpoint, the component would log the error and return it up the pipeline. This situation was handle the same as the transient error, which means the component could not tell the Host or anything else that something was wrong, but worse is that the issue would never get better. -Current State of Component Health Reporting -In Collector version v0.87.0 a new feature, component status reporting, was released. This feature created a standard mechanism for Components to report their health to extensions. - -Component status reporting is a collector feature that allows components to report their status (aka health) via status events to extensions. In order for an extension to receive these events it must implement the StatusWatcher interface. The collector provides the communication between components and extensions, but does not use or interpret the events itself. -Status Definitions +## Current State of Component Health Reporting -The system defines seven statuses, listed in the table below: +In Collector version v0.87.0 a new feature, component status reporting, was released. This feature created a standard mechanism for Components to report their health to extensions. +Component status reporting is a collector feature that allows components to report their status (aka health) via status events to extensions. In order for an extension to receive these events it must implement the [StatusWatcher interface](https://github.com/open-telemetry/opentelemetry-collector/blob/f05f556780632d12ef7dbf0656534d771210aa1f/extension/extension.go#L54-L63). The collector provides the communication between components and extensions, but does not use or interpret the events itself. -Status -Meaning -Starting -The component is starting. -OK -The component is running without issue. -RecoverableError -The component has experienced a transient error and may recover. -PermanentError -The component has detected a condition at runtime that will need human intervention to fix. The collector will continue to run in a degraded mode. -FatalError -The collector has experienced a fatal error and will shutdown. This is intended to be used as a way to fail-fast during startup. -Stopping -The component is in the process of shutting down. -Stopped -The component has completed shutdown. +### Status Definitions +The system defines seven statuses, listed in the table below: +| **Status** | **Meaning** | +|:----------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------:| +| Starting | The component is starting. | +| OK | The component is running without issue. | +| RecoverableError | The component has experienced a transient error and may recover. | +| PermanentError | The component has detected a condition at runtime that will need human intervention to fix. The collector will continue to run in a degraded mode. | +| FatalError | The collector has experienced a fatal error and will shutdown. This is intended to be used as a way to fail-fast during startup. | +| Stopping | The component is in the process of shutting down. | +| Stopped | The component has completed shutdown. | Statuses can be categorized into two groups: lifecycle and runtime. -Lifecycle Statuses -Starting -Stopping -Stopped +**Lifecycle Statuses** +- Starting +- Stopping +- Stopped -Runtime Statuses -OK -RecoverableError -FatalError -PermanentError -Transitioning Between Statuses +**Runtime Statuses** +- OK +- RecoverableError +- FatalError +- PermanentError -There is a finite state machine underlying the status reporting API that governs the allowable state transitions. See the state diagram below: +### Transitioning Between Statuses +There is a finite state machine underlying the status reporting API that governs the allowable state transitions. See the state diagram below: +![component-status-reporting-fsm](component-status-reporting-fsm.png) The finite state machine ensures that components progress through the lifecycle properly and it manages transitions through runtime states so that components do not need to track their state internally. Only changes in status result in new events being generated; repeat reports of the same status are ignored. PermanentError and FatalError are permanent runtime states. A component in these states cannot make any further state transitions. +![component-status-reporting-example](component-status-reporting-example.png) -Automation +### Automation The collector’s service implementation is responsible for starting and stopping components. Since it knows when these events occur and their outcomes, it can automate status reporting of lifecycle events for components. -Start +#### Start The collector will report a Starting event when starting a component. If an error is returned from Start, the collector will report a PermanentError event. If start returns without an error and the collector hasn't reported status itself, the collector will report an OK event. -Shutdown +#### Shutdown The collector will report a Stopping event when shutting down a component. If Shutdown returns an error, the collector will report a PermanentError event. If Shutdown completes without an error, the collector will report a Stopped event. Best Practices -Start +#### Start Under most circumstances, a component does not need to report explicit status during component.Start. An exception to this rule is components that start async work (e.g. spawn a go routine). This is because async work may or may not complete before start returns and timing can vary between executions. A component can halt startup by returning an error from start. If start returns an error, automated status reporting will report a PermanentError on behalf of the component. If a component wishes to halt startup from work spawned in a go routine, it can report a FatalError. If start returns without an error automated status reporting will report OK, as long as the component hasn't already reported for itself. -Runtime - +#### Runtime +![component-status-reporting-runtime](component-status-reporting-runtime.png) During runtime a component should not have to keep track of its state. A component should report status as operations succeed or fail and the finite state machine will handle the rest. Changes in status will result in new status events being emitted. Repeat reports of the same status will no-op. Similarly, attempts to make an invalid state transition, such as PermanentError to OK, will have no effect. -We intend to define guidelines to help component authors distinguish between recoverable and permanent errors on a per-component type basis and we'll update this document as we make decisions. See this issue for current thoughts and discussions. +We intend to define guidelines to help component authors distinguish between recoverable and permanent errors on a per-component type basis and we'll update this document as we make decisions. See [this issue](https://github.com/open-telemetry/opentelemetry-collector/issues/9957) for current thoughts and discussions. -Shutdown +#### Shutdown A component should never have to report explicit status during shutdown. Automated status reporting should handle all cases. To recap, the collector will report Stopping before Shutdown is called. If a component returns an error from shutdown the collector will report a PermanentError and it will report Stopped if Shutdown returns without an error. -In the Weeds + +### In the Weeds There are a couple of implementation details that are worth discussing for those who work on or wish to understand the collector internals. -component.TelemetrySettings +**component.TelemetrySettings** The API for components to report status is the ReportStatus method on the component.TelemetrySettings instance that is part of the CreateSettings passed to a component's factory during creation. It takes a single argument, a status event. The StatusWatcher interface takes both a component instance ID and a status event. The ReportStatus function is customized for each component and passes along the instance ID with each event. A component doesn't know its instance ID, but its ReportStatus method does. -servicetelemetry.TelemetrySettings +**servicetelemetry.TelemetrySettings** The service gets a slightly different TelemetrySettings object, a servicetelemetry.TelemetrySettings, which references the ReportStatus method on a status.Reporter. Unlike the ReportStatus method on component.TelemetrySettings, this version takes two arguments, a component instance ID and a status event. The service uses this function to report status on behalf of the components it manages. This is what the collector uses for the automated status reporting of lifecycle events. -sharedcomponent +**sharedcomponent** -The collector has the concept of a shared component. A shared component is represented as a single component to the collector, but represents multiple logical components elsewhere. The most common usage of this is the OTLP receiver, where a single shared component represents a logical instance for each signal: traces, metrics, and logs (although this can vary based on configuration). When a shared component reports status it must report an event for each of the logical instances it represents. In the current implementation, shared component reports status for all its logical instances during Start and Shutdown. It also modifies the ReportStatus method on component.TelemetrySettings to report status for each logical instance when called. -Open Questions and Proposed Changes -Remove Fatal Error +The collector has the concept of a shared component. A shared component is represented as a single component to the collector, but represents multiple logical components elsewhere. The most common usage of this is the OTLP receiver, where a single shared component represents a logical instance for each signal: traces, metrics, and logs (although this can vary based on configuration). When a shared component reports status it must report an event for each of the logical instances it represents. In the current implementation, shared component reports status for all its logical instances during [Start](https://github.com/open-telemetry/opentelemetry-collector/blob/31ac3336d956d93abede6db76453730613e1f076/internal/sharedcomponent/sharedcomponent.go#L89-L98) and [Shutdown](https://github.com/open-telemetry/opentelemetry-collector/blob/31ac3336d956d93abede6db76453730613e1f076/internal/sharedcomponent/sharedcomponent.go#L105-L117). It also [modifies the ReportStatus method](https://github.com/open-telemetry/opentelemetry-collector/blob/31ac3336d956d93abede6db76453730613e1f076/internal/sharedcomponent/sharedcomponent.go#L34-L44) on component.TelemetrySettings to report status for each logical instance when called. -There have been discussions about removing FatalError. The FatalError functionality predates the component status system and was incorporated into it as it provides related functionality. Component Start allows a component to terminate the startup of the collector by returning an error. The FatalError system augments this capability by allowing a component that starts async work in Start to terminate the collector if the async work fails. This is because the async work spawned in Start is indeterminate and not guaranteed to complete before Start returns. We have discussed removing FatalError and using PermanentError in its place. This would be a change in behavior as a PermanentError will not terminate collector execution, but it will result in a status event sent to registered status watchers, which can act on this information in a meaningful way. This would simplify the considerations during Start for component authors and promote more uniformity in outcomes from Start. If we were to remove FatalError before 1.0 we could always decide to add it back in a non-breaking way afterwards. - -Allow Transition from PermanentError to Stopping - -In the absence of a FatalError, this change would allow us to ensure a component always completes its lifecycle. A component that is in a PermanentError state will be stopped by the collector and its state should reflect that. This change will allow StatusWatchers to distinguish between a component that is not functioning due to an error vs one that is unavailable because it is in the process of starting or stopping. This was discussed briefly in this issue. - -Remove Status Aggregation from Core - -Also discussed in #10058 was the possibility of removing status aggregation from core. It was initially added to core with the assumption that it could be beneficial to StatusWatchers that wish to use it. It did not end up meeting the needs of the healthcheck v2 extension, which needed to reimplement the logic with additional flexibility. In the extension, aggregation is configuration dependent, leading one to question whether or not there is a one size fits all approach to status aggregation. It makes sense for us to consider removing status aggregation from core for now and adding it back in later after we have a better idea of how it should work. See this issue: https://github.com/open-telemetry/opentelemetry-collector/issues/10058 for more discussion. -Diagram Changes - -In order to visually understand the proposed changes, the diagrams have been redrawn to show what they would look like if we both remove FatalError and allow a component to transition from PermanentError to Stopping. +## Open Questions and Proposed Changes +### Remove Fatal Error +There have been [discussions](https://github.com/open-telemetry/opentelemetry-collector/issues/9823) about removing FatalError. The FatalError functionality predates the component status system and was incorporated into it as it provides related functionality. Component Start allows a component to terminate the startup of the collector by returning an error. The FatalError system augments this capability by allowing a component that starts async work in Start to terminate the collector if the async work fails. This is because the async work spawned in Start is indeterminate and not guaranteed to complete before Start returns. We have discussed removing FatalError and using PermanentError in its place. This would be a change in behavior as a PermanentError will not terminate collector execution, but it will result in a status event sent to registered status watchers, which can act on this information in a meaningful way. This would simplify the considerations during Start for component authors and promote more uniformity in outcomes from Start. If we were to remove FatalError before 1.0 we could always decide to add it back in a non-breaking way afterwards. +### Allow Transition from PermanentError to Stopping +In the absence of a FatalError, this change would allow us to ensure a component always completes its lifecycle. A component that is in a PermanentError state will be stopped by the collector and its state should reflect that. This change will allow StatusWatchers to distinguish between a component that is not functioning due to an error vs one that is unavailable because it is in the process of starting or stopping. This was discussed briefly in this [issue](https://github.com/open-telemetry/opentelemetry-collector/issues/10058). +### Remove Status Aggregation from Core +Also discussed in [#10058](https://github.com/open-telemetry/opentelemetry-collector/issues/10058) was the possibility of removing status aggregation from core. It was initially added to core with the assumption that it could be beneficial to StatusWatchers that wish to use it. It did not end up meeting the needs of the healthcheck v2 extension, which needed to reimplement the logic with additional flexibility. In the extension, aggregation is configuration dependent, leading one to question whether or not there is a one size fits all approach to status aggregation. It makes sense for us to consider removing status aggregation from core for now and adding it back in later after we have a better idea of how it should work. See this issue: https://github.com/open-telemetry/opentelemetry-collector/issues/10058 for more discussion. +### Diagram Changes +In order to visually understand the proposed changes, the diagrams have been redrawn to show what they would look like if we both remove FatalError and allow a component to transition from PermanentError to Stopping. State Diagram (complete) - +![component-status-reporting-update](component-status-reporting-update.png) State Diagram (runtime) +![component-status-reporting-runtime-updated](component-status-reporting-runtime.png) - -Should component health reporting be an opt-in for collector host & service implementations? +### Should component health reporting be an opt-in for collector host & service implementations? This brings about the risk of implementations not being compatible with status watchers extension. s. An example, when a processor in a pipeline is reporting a permanent error, the Collector Host must, by default, be allowed to continue using that pipeline. The Collector may be configured to handle this situation differently, such as disabling the pipeline and/or stopping a component. -The Goals the Component Health Reporting Should Achieve +## The Goals the Component Health Reporting Should Achieve The following are the goals, as of June 2024 and with Collector 1.0 looming, for a component health reporting system. -Runtime component health reporting, which cannot be automatically reported, must be opt-in for components. Collector components must not be required to use the component health reporting system. This keeps component as compatible as possible with the Collector’s framework as we approach 1.0 -The consumers of the health reporting system must be able to identify which components are and are not using the health reporting system. -Component health reporting must be opt-in for collector users. While the underlying components are always allowed to report their health via the system, the Collector Host or any other listener may only take action when the user has configured the collector accordingly. -As one example of compliance, the current health reporting system is dependent on the user configuring an extension that can watch for status updates. -Component health must be representable as a finite state machine with clear transitions between states. -The Collector Host may report statuses Starting, Ok, Stopping and PermanentError on behalf of components as documented in the automation section. -Additional status may be reported in the future -Component health reporting must only be a mechanism for reporting health - it should have no mechanisms for taking actions on the health it reports. How consumers of the health reporting system respond to component updates is not a concern of the health reporting system. +1. Runtime component health reporting, which cannot be automatically reported, must be opt-in for components. Collector components must not be required to use the component health reporting system. This keeps component as compatible as possible with the Collector’s framework as we approach 1.0 + 2. The consumers of the health reporting system must be able to identify which components are and are not using the health reporting system. +2. Component health reporting must be opt-in for collector users. While the underlying components are always allowed to report their health via the system, the Collector Host or any other listener may only take action when the user has configured the collector accordingly. + - As one example of compliance, the current health reporting system is dependent on the user configuring an extension that can watch for status updates. +3. Component health must be representable as a finite state machine with clear transitions between states. +4. The Collector Host may report statuses Starting, Ok, Stopping and PermanentError on behalf of components as documented in the automation section. + 2. Additional status may be reported in the future +5. Component health reporting must only be a mechanism for reporting health - it should have no mechanisms for taking actions on the health it reports. How consumers of the health reporting system respond to component updates is not a concern of the health reporting system. -Existing deviations from those goals +## Existing deviations from those goals TODO: express how we’ve tied a the ability for a component to stop the collector to the existing status reporting system by removing `Host.ReportFatalError` - this is in conflict with the goals that status reporting should be optional Is the goal wrong or is the implementation wrong? @@ -192,15 +176,11 @@ TODO: the collector host is currently always reporting status for the automatic TODO: we don’t currently have an implementation for “The consumers of the health reporting system must be able to identify which components are and are not using the health reporting system.” - -Desired Behavior for 1.0 +## Desired Behavior for 1.0 Component status reporting should not impact our ability to release a 1.0 collector - - - -Reference +## Reference Remove FatalError? Looking for opinions either way: https://github.com/open-telemetry/opentelemetry-collector/issues/9823 In order to prioritize lifecycle events over runtime events for status reporting, allow a component to transition from PermanentError -> Stopping: https://github.com/open-telemetry/opentelemetry-collector/issues/10058 Runtime status reporting for components in core: https://github.com/open-telemetry/opentelemetry-collector/issues/9957 @@ -218,7 +198,7 @@ https://github.com/open-telemetry/opentelemetry-collector/pull/6560 Merged https://github.com/open-telemetry/opentelemetry-collector/pull/8169 -Little did I know this comment would change my life forever: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/8816#issuecomment-1083380460 +Matt: Little did I know this comment would change my life forever: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/8816#issuecomment-1083380460