Skip to content

Commit

Permalink
[apm] Add examples using trace_continuation_strategy to sampling do…
Browse files Browse the repository at this point in the history
…cs (#4167)

* first attempt

* update diagrams

* reframe, restructure

* address feedback

* add titles to images, reference in text
  • Loading branch information
colleenmcginnis authored Aug 30, 2024
1 parent a311710 commit 363af38
Show file tree
Hide file tree
Showing 6 changed files with 60 additions and 11 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/en/observability/apm/images/dt-sampling-example-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/en/observability/apm/images/dt-sampling-example-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/en/observability/apm/images/dt-sampling-example-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
71 changes: 60 additions & 11 deletions docs/en/observability/apm/sampling.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -29,23 +29,69 @@ data might be discarded purely due to chance.

See <<apm-configure-head-based-sampling>> to get started.

**Distributed tracing with head-based sampling**
[float]
[[distributed-tracing-examples]]
===== Distributed tracing

In a distributed trace, the sampling decision is still made when the trace is initiated.
Each subsequent service respects the initial service's sampling decision, regardless of its configured sample rate;
the result is a sampling percentage that matches the initiating service.

In this example, `Service A` initiates four transactions and has sample rate of `.5` (`50%`).
The sample rates of `Service B` and `Service C` are ignored.
In the example in _Figure 1_, `Service A` initiates four transactions and has sample rate of `.5` (`50%`).
The upstream sampling decision is respected, so even if the sample rate is defined and is a different
value in `Service B` and `Service C`, the sample rate will be `.5` (`50%`) for all services.

.Upstream sampling decision is respected
image::./images/dt-sampling-example-1.png[Distributed tracing and head based sampling example one]

In this example, `Service A` initiates four transactions and has a sample rate of `1` (`100%`).
Again, the sample rates of `Service B` and `Service C` are ignored.
In the example in _Figure 2_, `Service A` initiates four transactions and has a sample rate of `1` (`100%`).
Again, the upstream sampling decision is respected, so the sample rate for all services will
be `1` (`100%`).

.Upstream sampling decision is respected
image::./images/dt-sampling-example-2.png[Distributed tracing and head based sampling example two]

**OpenTelemetry with head-based sampling**
[float]
===== Trace continuation strategies with distributed tracing

In addition to setting the sample rate, you can also specify which _trace continuation strategy_ to use.
There are three trace continuation strategies: `continue`, `restart`, and `restart_external`.

The *`continue`* trace continuation strategy is the default and will behave similar to the examples in
the <<distributed-tracing-examples,Distributed tracing section>>.

Use the *`restart_external`* trace continuation strategy on an Elastic-monitored service to start
a new trace if the previous service did not have a `traceparent` header with `es` vendor data.
This can be helpful if a transaction includes an Elastic-monitored service that is receiving requests
from an unmonitored service.

In the example in _Figure 3_, `Service A` is an Elastic-monitored service that initiates four transactions
with a sample rate of `.25` (`25%`). Because `Service B` is unmonitored, the traces started in
`Service A` will end there. `Service C` is an Elastic-monitored service that initiates four transactions
that start new traces with a new sample rate of `.5` (`50%`). Because `Service D` is also
Elastic-monitored service, the upstream sampling decision defined in `Service C` is respected.
The end result will be three sampled traces.

.Using the `restart_external` trace continuation strategy
image::./images/dt-sampling-continuation-strategy-restart_external.png[Distributed tracing and head based sampling with restart_external continuation strategy]

Use the *`restart`* trace continuation strategy on an Elastic-monitored service to start
a new trace regardless of whether the previous service had a `traceparent` header.
This can be helpful if an Elastic-monitored service is publicly exposed, and you do not
want tracing data to possibly be spoofed by user requests.

In the example in _Figure 4_, `Service A` and `Service B` are Elastic-monitored services that use the
default trace continuation strategy. `Service A` has a sample rate of `.25` (`25%`), and that
sampling decision is respected in `Service B`. `Service C` is an Elastic-monitored service that
uses the `restart` trace continuation strategy and has a sample rate of `1` (`100%`).
Because it uses `restart`, the upstream sample rate is _not_ respected in `Service C` and all four
traces will be sampled as new traces in `Service C`. The end result will be five sampled traces.

.Using the `restart` trace continuation strategy
image::./images/dt-sampling-continuation-strategy-restart.png[Distributed tracing and head based sampling with restart continuation strategy]

[float]
===== OpenTelemetry

Head-based sampling is implemented directly in the APM agents and SDKs.
The sample rate must be propagated between services and the managed intake service in order to produce accurate metrics.
Expand All @@ -54,13 +100,16 @@ OpenTelemetry offers multiple samplers. However, most samplers do not propagate
This results in inaccurate span-based metrics, like APM throughput, latency, and error metrics.

For accurate span-based metrics when using head-based sampling with OpenTelemetry, you must use
a [consistent probability sampler](https://opentelemetry.io/docs/specs/otel/trace/tracestate-probability-sampling/).
a https://opentelemetry.io/docs/specs/otel/trace/tracestate-probability-sampling/[consistent probability sampler].
These samplers propagate the sample rate between services and the managed intake service, resulting in accurate metrics.

NOTE: OpenTelemetry does not offer consistent probability samplers in all languages.
[NOTE]
====
OpenTelemetry does not offer consistent probability samplers in all languages.
OpenTelemetry users should consider using tail-based sampling instead.
+
Refer to the documentation of your favorite OpenTelemetry agent or SDK for more information on the availability of consistent probability samplers.
====

[float]
[[apm-tail-based-sampling]]
Expand Down Expand Up @@ -99,7 +148,7 @@ and will work with traces sent by either Elastic APM agents or OpenTelemetry SDK
Due to <<apm-open-telemetry-tbs,OpenTelemetry tail-based sampling limitations>> when using https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor[tailsamplingprocessor], we recommend using APM Server tail-based sampling instead.

[float]
=== Sampled data and visualizations
==== Sampled data and visualizations

A sampled trace retains all data associated with it.
A non-sampled trace drops all <<apm-data-model-spans,span>> and <<apm-data-model-transactions,transaction>> data^1^.
Expand All @@ -125,7 +174,7 @@ The {kib} apps that utilize RUM data depend on transaction events,
so non-sampled RUM traces retain transaction data -- only span data is dropped.

[float]
=== Sample rates
==== Sample rates

What's the best sampling rate? Unfortunately, there isn't one.
Sampling is dependent on your data, the throughput of your application, data retention policies, and other factors.
Expand Down

0 comments on commit 363af38

Please sign in to comment.