Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[component] Move component status features to its own module #10725

Conversation

TylerHelmuth
Copy link
Member

@TylerHelmuth TylerHelmuth commented Jul 25, 2024

Description

This PR is an attempt are removing component status reporting features from the component module. The technique was to make a new componentstatus.StatusReporter interface that component.Host could optionally implement. This has impacts:

  1. We no longer need a servicetelemetry.TelemetrySettings struct.
  2. Components no longer get access to a ReportStatus function until the component.Start method provides the host.
  3. Components that use sharedcomponent will now only report a status for instances that have been started when an error in start occurs. As an example if an otlpreceiver is used in a traces, metrics, and logs pipeline, when the host make the initial call to Start, say for the traces pipeline, sharedcomponent will only report the error returned from Start for the traces instance, and then continue on to shutdown.

If this solution's general idea is tenable, then I believe it could be broken down into slightly smaller PRs, such as:

  1. Remove the need for servicetelemetry.TelemetrySettings: [service] Remove servicetelemetry.TelemetrySettings #10728
  2. Move status stuff into its own module: [component] Move component status reporting public API to new componentstatus module #10730
  3. Remove ReportStatus from TelemetrySettings and implement the StatusReporter interface: [component] Remove ReportStatus from component.TelemetrySettings #10777

Link to tracking issue

Related to #10413

@TylerHelmuth TylerHelmuth force-pushed the component-remove-reporting-from-struct branch 3 times, most recently from 2ea4302 to 6f486c7 Compare July 25, 2024 19:05
Copy link

codecov bot commented Jul 25, 2024

Codecov Report

Attention: Patch coverage is 81.27660% with 44 lines in your changes missing coverage. Please review.

Project coverage is 92.12%. Comparing base (95902c1) to head (900cebe).
Report is 1 commits behind head on main.

Files Patch % Lines
service/internal/graph/graph.go 77.41% 21 Missing ⚠️
service/service.go 69.56% 0 Missing and 7 partials ⚠️
receiver/otlpreceiver/otlp.go 0.00% 6 Missing ⚠️
internal/sharedcomponent/sharedcomponent.go 87.87% 4 Missing ⚠️
extension/zpagesextension/zpagesextension.go 0.00% 3 Missing ⚠️
processor/processortest/unhealthy_processor.go 50.00% 1 Missing and 1 partial ⚠️
service/internal/status/status.go 97.05% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10725      +/-   ##
==========================================
- Coverage   92.18%   92.12%   -0.06%     
==========================================
  Files         403      400       -3     
  Lines       18792    18816      +24     
==========================================
+ Hits        17323    17335      +12     
- Misses       1109     1121      +12     
  Partials      360      360              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TylerHelmuth TylerHelmuth force-pushed the component-remove-reporting-from-struct branch from 6f486c7 to 900cebe Compare July 25, 2024 19:15
Comment on lines 77 to 88
if c.hostWrapper == nil {
c.hostWrapper = &hostWrapper{
host: host,
sources: make([]componentstatus.StatusReporter, 0),
previousEvents: make([]*componentstatus.StatusEvent, 0),
}
}

statusReporter, isStatusReporter := host.(componentstatus.StatusReporter)
if isStatusReporter {
c.hostWrapper.addSource(statusReporter)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new technique for sharedcomponent is to build the chain or reporting sources as Start is getting called. Previously reported events are remembered so that if Start is called multiple times, each subsequent StatusReporter and get the previously emitted StatusEvents.

Copy link
Member

@mx-psi mx-psi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems reasonable to me, I think the ergonomics are acceptable and it allows us to separate the status reporting mechanism from the rest of the component module

extension/extension.go Show resolved Hide resolved
receiver/otlpreceiver/otlp.go Outdated Show resolved Hide resolved
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me overall.

component/componentstatus/status.go Outdated Show resolved Hide resolved
internal/sharedcomponent/sharedcomponent.go Outdated Show resolved Hide resolved
internal/sharedcomponent/sharedcomponent.go Outdated Show resolved Hide resolved
@djaglowski
Copy link
Member

One other note - component.InstanceID was added for the sake of component status (though in theory it may be useful otherwise). I think we may want to move it out of component as well.

@TylerHelmuth
Copy link
Member Author

@djaglowski I've moved InstanceID to componentstatus (which in the actually implementation I'll handle by adding it when the content is moved to its own package and deprecating the one in component).

@mx-psi the fact that this move is possible makes me think maybe #10222 is back on the table, but I need to think about it more.

Copy link
Member

@mwear mwear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sharedcomponent changes worry me a little bit. I'd like to see if I can test drive it with the health check extension to make sure it is still working as intended. Otherwise, this generally looks good to me.

@TylerHelmuth
Copy link
Member Author

@mwear the key change happens when a component.Start returns a synchronous error, which gets back to the host and ends up terminating the startup process. The other signals for the sharedcomponent won't receive the NewPermanent error, but also they never receiver a Starting status either. They get not statuses reported for them, like another component further down the line that also did not get started.

@mwear
Copy link
Member

mwear commented Jul 26, 2024

What about the recoverable error case? When reporting status from within the component (I guess this is via host now), it should report status for all of the components it represents. All components represented by the shared component must be in the Starting state before the API can be from called from "within" the shared component. If you report RecoverableError during start, but some instances are not Starting, you will get an invalid state transition for the unstarted instances, and those instances will be stuck reporting an invalid status. I'm not sure if this makes sense, but this is my concern with this change. It's a little hard to test this end to end with the health check extension to verify because there are so many changes. In any case, if these words make any sense, and are accurate, we may have to handle this another way.

@mwear
Copy link
Member

mwear commented Jul 26, 2024

It's possible that the way you are handling previous events solves my concerns. I'd still like to find a way to verify this end to end.

mx-psi pushed a commit that referenced this pull request Jul 29, 2024
#### Description
Reorganizes service to not require `servicetelemetry.TelemetrySettings`
and instead depend directly on `component.TelemetrySettings`

Whether or not we move forward with
#10725 I
think this is a useful change for service.

#### Testing
Unit tests
@TylerHelmuth
Copy link
Member Author

@mwear I'll see if I can add a test case in this PR with that example.

@TylerHelmuth
Copy link
Member Author

TylerHelmuth commented Jul 29, 2024

@mwear I've added a very rough e2e test in internal/e2e/status_test.go that tests that a component that is using sharedcomponent does end up eventually reporting all statuses for each instance and that an extension watching for statuses does receive all the events.

This gives me confidence that the statuses do end up in the extensions correctly. How the extension handles the statuses is up to the extension right?

One thing of note is that the existing (in main) graph/sharedinstance implementation runs into an invalid state in the graph's fsm because the graph reports StatusStarting for the current pipeline and the sharedinstance implementation also reports StatusStarting for all instances, one of which was already reported as StatusStarting.

I highlight this as a datapoint that shows that status reporting still needs lots of maturing, and while this PR changes a lot about how status reporting works, I think that is ok since it is still so young. I feel confident that if we do find issues with this implementation later on that we'll be able to make fixes via graph/sharedinstance/componentstatus without needing to break user APIs or component behavior.

@mwear
Copy link
Member

mwear commented Jul 29, 2024

Thanks for the test @TylerHelmuth, this gives me more confidence. I'm aware of the repeat status reporting that was part of the original implementation. The finite state machine is meant to protect and enforce the lifecycle of a component, and invalid state transitions are not necessarily errors, in many cases should be considered a no-op. I know there is some logging around this that I would be in favor or removing because it gives the impression something unexpected happened, when that's not exactly the case. This is discussed in the status reporting documentation under the runtime section.

During runtime a component should not have to keep track of its state. A component should report status as operations succeed or fail and the finite state machine will handle the rest. Changes in status will result in new status events being emitted. Repeat reports of the same status will no-op. Similarly, attempts to make an invalid state transition, such as PermanentError to OK, will have no effect.

@TylerHelmuth
Copy link
Member Author

@mwear do you feel comfortable enough to move forward with this approach via #10730?

@mwear
Copy link
Member

mwear commented Jul 29, 2024

My biggest reservation is that we need the collector to be observable for it to be 1.0. Part of this means having a functioning health check extension. The current replacement requires component status, so I don't think we can call the collector 1.0 without having component status enabled by default. If we are moving this code for organization purposes, that's fine, but I don't think we can skip this work for 1.0 altogether.

@mx-psi
Copy link
Member

mx-psi commented Jul 30, 2024

Part of this means having a functioning health check extension

The healthcheck extension is not in our GA roadmap. I think it's important that we keep working on it, but our current plans don't include this for 1.0 as defined in the GA roadmap

bogdandrutu pushed a commit that referenced this pull request Jul 31, 2024
…ntstatus module (#10730)

#### Description

Duplicates component status reporting features from `component` into a
separate module, `componentstatus`. In a future PR, when
`component.TelemetrySettings.ReportStatus` is removed, I'll update Core
to depend on `componentstatus`.

This work isolates component status public API from `component` and
`extensions`, which will allow us to move forward with their 1.0 work
while component status reporting matures.

<!-- Issue number if applicable -->
#### Link to tracking issue
Related to
#10725

---------

Co-authored-by: Matthew Wear <matthew.wear@gmail.com>
@TylerHelmuth
Copy link
Member Author

Ive created #10777 to complete the transition in core.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Aug 15, 2024
mx-psi added a commit that referenced this pull request Aug 16, 2024
)

#### Description

This PR removes `ReportStatus` from `component.TelemetrySettings` and
instead expects components to check if their `component.Host` implements
a new `componentstatus.Reporter` interface.

<!-- Issue number if applicable -->
#### Link to tracking issue
Related to
#10725
Related to
#10413

<!--Describe what testing was performed and which tests were added.-->
#### Testing
unit tests and a sharedinstance e2e test.

The contrib tests will fail because this is a breaking change. If we
merge this I and @mwear can commit to updating contrib before the next
release.

---------

Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants