Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup deadlock when component uses host.ReportFatalError() #8116

Open
jmacd opened this issue Jul 20, 2023 · 6 comments
Open

Startup deadlock when component uses host.ReportFatalError() #8116

jmacd opened this issue Jul 20, 2023 · 6 comments
Assignees
Labels
area:service bug Something isn't working

Comments

@jmacd
Copy link
Contributor

jmacd commented Jul 20, 2023

Describe the bug

If a component tries to call host.ReportFatalError() during startup, the caller will block on the asynchronous error channel.

The asynchronous error channel does not have a consumer at this point, because Start() hasn't finished and we're not running.
So the start is blocked, but many of the components are started and running by now and logs are being written which makes it easy to miss the fact that start never finished. The signal handlers have not been installed, adds additional complication to the user who is trying to understand this scenario.

The documentation says not to use ReportFatalError() in Start(), so this is somewhat mitigated by telling the user to be careful.

Steps to reproduce
Use host.ReportFatalError() during start.

What did you expect to see?
The collector would fail during startup.

What did you see instead?
As described above, a deadlock happens.

What version did you use?
v0.80.0

What config did you use?
Won't paste it, since we understand the cause.

The offending component's incorrect use:
https://github.com/lightstep/telemetry-generator/blob/e86897152cf38767d2538da629858e7680f12894/generatorreceiver/generator_receiver.go#L50

@jmacd
Copy link
Contributor Author

jmacd commented Oct 25, 2023

@mwear This is related to your current work.

@atoulme
Copy link
Contributor

atoulme commented Dec 19, 2023

ReportFatalError is now deprecated in favor of TelemetrySettings.ReportComponentStatus(...), fwiw, so maybe we don't need to fix this problem.

@mwear
Copy link
Member

mwear commented Dec 19, 2023

I think the same underlying issue still exists. I can look into this more and either fix or close this when I have more details. Feel free to assign me.

@yurishkuro
Copy link
Member

Why does a component need to report fatal status during startup via a separate API when it can already return error from Start?

@mwear
Copy link
Member

mwear commented Jan 29, 2024

If a component wishes to return a fatal error from start, and it can do so synchronously, it should just return an error from start. There are a number of components that start async work in go routines and if they want to report a fatal error, they need a separate API. We should audit components for non-async fatals. Now that we have component status reporting, and the notion of a permanent error, we can consider if some fatals should be permanent. The difference is that a permanent error will allow a collector to continue running in a degraded mode, but will require human intervention to fix.

@atoulme
Copy link
Contributor

atoulme commented Jan 29, 2024

I filed #9324 as a related issue - it's not clear to me when it's admissible to return an error, and what type of error, in the Start function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:service bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants