-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancement: honor span limits when collecting events #93
base: v0.1.x
Are you sure you want to change the base?
Conversation
@@ -1022,6 +1022,20 @@ where | |||
let mut otel_data = span.extensions_mut().remove::<OtelData>(); | |||
let span_builder = otel_data.as_mut().map(|data| &mut data.builder); | |||
|
|||
let limits = self.tracer.span_limits(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be worth looking at the benchmarks to see what the perf impact of this is. Could alternatively load this when the layer is initialized or otherwise lazy load this value once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll try to run the benchmarks and get back to you hopefully tomorrow.
Could alternatively load this when the layer is initialized or otherwise lazy load this value once?
Right now, probably. I'm not sure in the long-term though, because the limits may be changed through a TracerProvider
and all Tracer
s should honor that change even if they were created earlier. Currently, the limits cannot be changed after a provider is created in opentelemetry-sdk, but they may add it later. I do not know of any other Tracer implementation, but that could also be a consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a benchmark with events, see this PR for details.
Current HEAD
otel_many_events/spec_baseline
time: [19.951 µs 19.969 µs 19.991 µs]
change: [-0.2111% +0.0843% +0.3784%] (p = 0.58 > 0.05)
No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
3 (3.00%) high mild
8 (8.00%) high severe
otel_many_events/no_data_baseline
time: [48.757 µs 48.803 µs 48.859 µs]
change: [+0.1993% +0.4747% +0.7119%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
4 (4.00%) high mild
10 (10.00%) high severe
otel_many_events/data_only_baseline
time: [68.263 µs 68.286 µs 68.313 µs]
change: [+0.2342% +0.5605% +0.8355%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) high mild
3 (3.00%) high severe
otel_many_events/full_filtered
time: [251.47 µs 251.64 µs 251.83 µs]
change: [+82.121% +82.453% +82.717%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
3 (3.00%) high mild
8 (8.00%) high severe
otel_many_events/full_not_filtered
time: [280.82 µs 280.92 µs 281.02 µs]
change: [+2.0043% +2.1354% +2.2589%] (p = 0.00 < 0.05)
Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
2 (2.00%) low severe
3 (3.00%) low mild
6 (6.00%) high mild
4 (4.00%) high severe
This PR
otel_many_events/spec_baseline
time: [19.504 µs 19.520 µs 19.540 µs]
change: [-2.4723% -2.1766% -1.8365%] (p = 0.00 < 0.05)
Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
8 (8.00%) high mild
10 (10.00%) high severe
otel_many_events/no_data_baseline
time: [48.471 µs 48.558 µs 48.669 µs]
change: [-0.7518% -0.6127% -0.4556%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 16 outliers among 100 measurements (16.00%)
3 (3.00%) high mild
13 (13.00%) high severe
otel_many_events/data_only_baseline
time: [67.843 µs 67.891 µs 67.947 µs]
change: [-0.6862% -0.5205% -0.3061%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low mild
5 (5.00%) high mild
4 (4.00%) high severe
otel_many_events/full_filtered
time: [136.64 µs 136.72 µs 136.81 µs]
change: [-45.679% -45.580% -45.464%] (p = 0.00 < 0.05)
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
2 (2.00%) high mild
11 (11.00%) high severe
otel_many_events/full_not_filtered
time: [292.62 µs 292.72 µs 292.82 µs]
change: [+4.1160% +4.2252% +4.3375%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) low severe
1 (1.00%) low mild
3 (3.00%) high mild
2 (2.00%) high severe
The benchmarks do not seem to be very consistent though. It might be just due to background workload on my machine, but the first benchamarks which do not depend on the changes at all have wildly different results so it would be great if someone else would also run it in some better environment.
The default for filtering is 128 events and the benchmark generates 1000 so the filtered result has much better performance because it doesn't need to process the events that would be discarded anyway.
The not filtered benchmark sets the limit to 1000 so nothing is filtered, there is a 4% penalty.
I've converted this into a draft. I've found out there is an issue with the events (#94) and this might be better to finish on top of that since that moves stuff around. I've also realized that simply ignoring the events will lead to incorrect information in the number of dropped events so we should keep at least a count somewhere and then inflate the builder with dummy events. Or decide that losing this information is worth the potential performance cost, but that is to be seen. |
## Motivation #94 - Mutli-threaded tracing drops most of the events ## Solution I have reorganized the code so that the lock on extensions is held, but not while `record` is called which should solve the deadlock originally solved in #59 ### Benchmark I've used the benchmark from #93, please ignore the "filtered" vs "non_filtered" distinction, for the purpose of this PR they should be identical cases. See results in later comment as the code got changed.
It should be ready for review now. The code now discards all events, attributes, and links, which would be discarded by otel. One thing I'm particularly unhappy with is the fact that we do not signal that some events have been dropped but that would require either changes on the otel side so that we could construct a span and set the value directly, or creating dummy values in the events or attributes vectors where there is a risk that BenchmarkIt seems that this change has negative impact on performance when filtering does not kick in. I'll try to see why that is and if I can fix that with some inlining or if I can work better with the optional limits. I feel like I must be doing something wrong because the checks themselves should not amount to 10 % even though the benchmarks do seem to fluctuate a lot. I seem to remember that you suggested we load the span limits just once (I can't find that comment in this PR), I'll probably try that and see if that has any effect. Details
|
Motivation
The tracing-opentelemetry library uses SpanBuilder to create spans and puts events in there, but if there are more than the configured limit, these are kept around, ignored by Tracer when a span is actually built and then discarded.
This can be an issue especially for long-running spans with much larger number of events than the limit as they are effectively just leaking memory.
Solution
I've tried to ask at the OpenTelemetry side (issue, if this can be achieved without relying on the SDK, but then I've found out this dependency already exists and is used for filtering spans.
I've added a method to the
PreSampleTracer
to obtain the limits, which I also provided a default implementation for since tracers are not required to support limits at all. It might be better to not provide the default implementation and let the implementors explicitly declare, that they do not support any limits, I'll be happy to change it if you think that that would be better.I am not sure if this is the correct place for this, but it felt more correct than to create a new trait for
LimittedTracer
and requiring that trait also inOpenTelemetryLayer
. If this goes through as is, the current trait could be renamed in a breaking change to better convey the meaning.The limits are then loaded when an event is to be stored in a
SpanBuilder
and if a limit has been reached, the layer will discard the event. I am not sure if there are some side-effects later in the function (namely inEvent::record
) which should run regardless of the limits, so the place, where the event handling is aborted may need to be changed.