Skip to content

Commit

Permalink
span sampling rules (#231)
Browse files Browse the repository at this point in the history
* undo non-test related removals from #218

* fix bug

* add additional "sample" test

* clang-format

* glob.{h,cpp}

* class SpanSampler, without using or testing it

* small (premature?) optimization

* TODO: AMEND

* checkpoint: added span rules to Tracer, but not elsewhere

* checkpoint: add span rules to SpanBuffer, but don't use them

* checkpoint: written but untested

* test rule parsing

* obnoxious compiler

* test rule matching

* test span sampling

* clang-format

* test environment variables

* repo-level documentation

* attempt at fixing doc links

* describe glob in the repo-level docs

* unit tests catch bugs!

* a little const correctness

* code comment typo

* temporary files are temporary

* review: trim some docs and add an example

* review: describe globs more concretely

* review: make a future diff smaller
  • Loading branch information
dgoffredo authored Jul 15, 2022
1 parent 83a9d6a commit 3db282f
Show file tree
Hide file tree
Showing 23 changed files with 1,253 additions and 38 deletions.
2 changes: 2 additions & 0 deletions BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ cc_library(
"src/clock.h",
"src/encoder.cpp",
"src/encoder.h",
"src/glob.cpp",
"src/glob.h",
"src/limiter.cpp",
"src/limiter.h",
"src/logger.cpp",
Expand Down
25 changes: 24 additions & 1 deletion doc/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ specified as a JSON array of objects.
For more information about the configuration of trace sampling, see
[sampling.md][6].

- **TracerOptions member**: `std::string sampling_rules`
- **TracerOptions member**: `std::string sampling_rules` _(JSON)_
- **JSON property**: `"sampling_rules"` _(array of objects)_
- **Environment variable**: `DD_TRACE_SAMPLING_RULES` _(JSON)_
- **Default value**: `[]`
Expand Down Expand Up @@ -253,6 +253,28 @@ made, is propagated between services along the trace in the form of the
order to prevent rejection by peers or other HTTP header policies. This
configuration option is that limit, in bytes.

### Span Sampling Rules
Span sampling rules allow spans to be sent to Datadog that otherwise would be
dropped due to trace sampling.

For more information about the configuration of span sampling, see the [Span
Sampling][11] section of [sampling.md][6].

- **TracerOptions member**: `std::string span_sampling_rules` _(JSON)_
- **JSON property**: `"span_sampling_rules"` _(array of objects)_
- **Environment variable**: `DD_SPAN_SAMPLING_RULES` _(JSON)_
- **Default value**: `[]`

### Span Sampling Rules File
Span sampling rules (see above) can be specified in their own file. The value
of the `DD_SPAN_SAMPLING_RULES_FILE` environment variable is the path to a file
whose contents are the span sampling rules JSON array.

- **Environment variable**: `DD_SPAN_SAMPLING_RULES_FILE`

Note that `DD_SPAN_SAMPLING_RULES_FILE` is ignored when
`DD_SPAN_SAMPLING_RULES` is also in the environment.

- **TracerOptions member**: `uint64_t tags_header_size`
- **JSON property**: `tags_header_size` _(number)_
- **Environment variable**: `DD_TRACE_TAGS_PROPAGATION_MAX_LENGTH`
Expand All @@ -267,3 +289,4 @@ configuration option is that limit, in bytes.
[7]: https://github.com/openzipkin/b3-propagation
[8]: https://pubs.opengroup.org/onlinepubs/9699919799/
[9]: https://docs.datadoghq.com/getting_started/tagging/unified_service_tagging
[11]: sampling.md#span-sampling
50 changes: 50 additions & 0 deletions doc/sampling.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,5 +120,55 @@ This configuration option has the same meaning as the `DD_TRACE_RATE_LIMIT`
environment variable. Note that the environment variable overrides the
`TracerOptions` field if both are specified.

Span Sampling
-------------
Span sampling is used to select spans to keep even when the enclosing
trace is dropped.

Similar to _trace_ sampling rules, _span_ sampling rules are configured as a
JSON array of object, where each object may contain the following properties:
```
[{
"service": <matches the span's service name, or any if absent>,
"name": <matches the span's operation name, or any if absent>,
"sample_rate": <the probability of sampling matching spans, or 1.0 if absent>,
"max_per_second": <limit in spans sampled by this rule each second, or unlimited if absent>
}, ...]
```

The `service` and `name` are glob patterns, where "glob" here means:
- `*` matches any substring, including the empty string,
- `?` matches exactly one of any character, and
- any other character matches exactly one of itself.

Span sampling rules are examined only when the enclosing trace is to be
dropped.

The first span sampling rule that matches a span is used to make a span
sampling decision for that span. If the decision is "keep," then the span is
sent to Datadog despite the enclosing trace having been dropped.

Span sampling rules can be configured [directly][3] or [in a file][4].

For example, consider the following span sampling rules:
```shell
export DD_SPAN_SAMPLING_RULES='[
{"service": "router", "name": "rack.request", "max_per_second": 2000},
{"service": "classic-mysql", "name": "mysql2.*"},
{"service": "authn?", "sample_rate": 0.5}
]'
```
These rules state:

- When a trace is dropped, keep spans whose service name is `router` and whose
operation name is `rack.request`, but keep at most 2000 such spans per
second.
- When a trace is dropped, keep spans whose service name is `classic-mysql` and
whose operation name begins with `mysql2.`.
- When a trace is dropped, keep 50% of spans whose service name is `authn`
followed by another character, e.g. `authny`, `authnj`.

[1]: https://docs.datadoghq.com/tracing/trace_ingestion/mechanisms/?tab=environmentvariables#in-the-agent
[2]: https://docs.datadoghq.com/tracing/setup_overview/proxy_setup/?tab=nginx
[3]: configuration.md#span-sampling-rules
[4]: configuration.md#span-sampling-rules-file
74 changes: 61 additions & 13 deletions include/datadog/opentracing.h
Original file line number Diff line number Diff line change
Expand Up @@ -76,19 +76,19 @@ struct TracerOptions {
double sample_rate = std::nan("");
// This option is deprecated, and may be removed in future releases.
bool priority_sampling = true;
// Rules sampling is applied when initiating traces to determine the sampling
// rate. Configuration is specified as a JSON array of objects. Each object
// must have a "sample_rate", while the "name" and "service" fields are
// optional. The "sample_rate" value must be between 0.0 and 1.0 (inclusive).
// Rules are checked in order, so a more specific rule should be specified
// before a less specific rule. Note that if the `sample_rate` field of this
// `TracerOptions` has a non-NaN value, then there is an implicit rule at the
// end of the list that matches any trace unmatched by other rules, and
// applies a sampling rate of `sample_rate`. If no rule matches a trace,
// then "priority sampling" is applied instead, where the sample rate is
// determined by the Datadog trace agent. If any rules are invalid, they are
// ignored. This option is also configurable as the environment variable
// DD_TRACE_SAMPLING_RULES.
// Rule-based trace sampling is applied when initiating traces to determine
// the sampling rate. Configuration is specified as a JSON array of objects.
// Each object must have a "sample_rate", while the "name" and "service"
// fields are optional. The "sample_rate" value must be between 0.0 and 1.0
// (inclusive). Rules are checked in order, so a more specific rule should
// be specified before a less specific rule. Note that if the `sample_rate`
// field of this `TracerOptions` has a non-NaN value, then there is an
// implicit rule at the end of the list that matches any trace unmatched by
// other rules, and applies a sampling rate of `sample_rate`. If no rule
// matches a trace, then "priority sampling" is applied instead, where the
// sample rate is determined by the Datadog trace agent. If any rules are
// invalid, they are ignored. This option is also configurable as the
// environment variable DD_TRACE_SAMPLING_RULES.
std::string sampling_rules = "[]";
// Max amount of time to wait between sending traces to agent, in ms. Agent discards traces older
// than 10s, so that is the upper bound.
Expand Down Expand Up @@ -156,6 +156,54 @@ struct TracerOptions {
// serialized tags allowed. Trace-wide tags whose serialized length exceeds
// this limit are not propagated.
uint64_t tags_header_size = 512;
// Rule-based span sampling, which is distinct from rule-based trace
// sampling, is used to determine which spans to keep, if any, when trace
// sampling decides to drop the trace.
// When the trace is to be dropped, each span is matched against the
// `span_sampling_rules`. For each span, the first rule to match, if any,
// applies to the span and a span-specific sampling decision is made. If the
// decision for the span is to keep, then the span is sent to Datadog even
// though the enclosing trace is not.
// `span_sampling_rules` is a JSON array of objects, where each object has
// the following shape:
//
// {
// "service": <pattern>,
// "name": <pattern>,
// "sample_rate": <number between 0.0 and 1.0>,
// "max_per_second": <positive number>
// }
//
// The properties mean the following:
//
// - "service" is a glob pattern that must match a span's service name in
// order for the rule to match. If "service" is not specified, then its
// default value is "*". Glob patterns are described below.
// - "name" is a glob pattern that must match a span's operation name in
// order for the rule to match. If "name" is not specified, then its default
// value is "*". Glob patterns are described below.
// - "sample_rate" is the probability that a span matching the rule will be
// kept. If "sample_rate" is not specified, then its default value is 1.0.
// - "max_per_second" is the maximum number of spans that will be kept on
// account of this rule each second. Spans that would cause the limit to
// be exceeded are dropped. If "max_per_second" is not specified, then
// there is no limit.
//
// Glob patterns are a simplified form of regular expressions. Certain
// characters in a glob pattern have special meaning:
//
// - "*" matches any substring, including the empty string.
// - "?" matches exactly one instance of any character.
// - Other characters match exactly one instance of themselves.
//
// For example:
//
// - The glob pattern "foobar" is matched by "foobar" only.
// - The glob pattern "foo*" is matched by "foobar", "foo", and "fooop", but
// not by "fond".
// - The glob pattern "a?b*e*" is matched by "amble" and "albedo", but not by
// "albino".
std::string span_sampling_rules = "[]";
};

// TraceEncoder exposes the data required to encode and submit traces to the
Expand Down
59 changes: 59 additions & 0 deletions src/glob.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#include "glob.h"

#include <cstdint>

namespace datadog {
namespace opentracing {

bool glob_match(ot::string_view pattern, ot::string_view subject) {
// This is a backtracking implementation of the glob matching algorithm.
// The glob pattern language supports `*` and `?`, but no escape sequences.
//
// Based off of a Go example in <https://research.swtch.com/glob> accessed
// February 3, 2022.

using Index = std::size_t;
Index p = 0; // [p]attern index
Index s = 0; // [s]ubject index
Index next_p = 0; // next [p]attern index
Index next_s = 0; // next [s]ubject index

while (p < pattern.size() || s < subject.size()) {
if (p < pattern.size()) {
const char pattern_char = pattern[p];
switch (pattern_char) {
case '*':
// Try to match at `s`. If that doesn't work out, restart at
// `s + 1` next.
next_p = p;
next_s = s + 1;
++p;
continue;
case '?':
if (s < subject.size()) {
++p;
++s;
continue;
}
break;
default:
if (s < subject.size() && subject[s] == pattern_char) {
++p;
++s;
continue;
}
}
}
// Mismatch. Maybe restart.
if (0 < next_s && next_s <= subject.size()) {
p = next_p;
s = next_s;
continue;
}
return false;
}
return true;
}

} // namespace opentracing
} // namespace datadog
30 changes: 30 additions & 0 deletions src/glob.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#ifndef DD_OPENTRACING_GLOB_H
#define DD_OPENTRACING_GLOB_H

// This component provides a string matching function, `glob_match`, that
// returns whether a specified string matches a specified pattern, where the
// pattern language is the following:
//
// - "*" matches any contiguous substring, including the empty string.
// - "?" matches exactly one instance of any character.
// - Other characters match exactly one instance of themselves.
//
// The patterns are here called "glob patterns," though they are different from
// the patterns used in Unix shells.

#include <opentracing/string_view.h>

namespace ot = opentracing;

namespace datadog {
namespace opentracing {

// Return whether the specified `subject` matches the specified glob `pattern`,
// i.e. whether `subject` is a member of the set of strings represented by the
// glob `pattern`.
bool glob_match(ot::string_view pattern, ot::string_view subject);

} // namespace opentracing
} // namespace datadog

#endif
33 changes: 32 additions & 1 deletion src/pending_trace.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ const std::string event_sample_rate_metric = "_dd1.sr.eausr";
const std::string rules_sampler_applied_rate = "_dd.rule_psr";
const std::string rules_sampler_limiter_rate = "_dd.limit_psr";
const std::string priority_sampler_applied_rate = "_dd.agent_psr";
const std::string span_sampling_mechanism = "_dd.span_sampling.mechanism";
const std::string span_sampling_rule_rate = "_dd.span_sampling.rule_rate";
const std::string span_sampling_limit = "_dd.span_sampling.max_per_second";

// Return whether the specified `span` is without a parent among the specified
// `all_spans_in_trace`.
Expand Down Expand Up @@ -71,6 +74,25 @@ void finish_root_span(PendingTrace& trace, SpanData& span) {
finish_span(trace, span);
}

// Determine whether the specified `span` matches a rule in the specified
// `span_sampler` and the sampling decision of that rule is to keep the `span`.
// If so, then add appropriate tags to `span`.
void apply_span_sampling(SpanSampler& span_sampler, SpanData& span) {
SpanSampler::Rule* const rule = span_sampler.match(span);
if (!rule || !rule->sample(span)) {
return;
}

// The span matched a span rule, and the rule decided to keep the span.
// Add span-sampling-specific tags to the span.
span.metrics[span_sampling_mechanism] = int(SamplingMechanism::SpanRule);
span.metrics[span_sampling_rule_rate] = rule->config().sample_rate;
const double limit = rule->config().max_per_second;
if (!std::isnan(limit)) {
span.metrics[span_sampling_limit] = limit;
}
}

} // namespace

PendingTrace::PendingTrace(std::shared_ptr<const Logger> logger, uint64_t trace_id)
Expand All @@ -87,7 +109,7 @@ PendingTrace::PendingTrace(std::shared_ptr<const Logger> logger, uint64_t trace_
all_spans(),
sampling_priority(std::move(sampling_priority)) {}

void PendingTrace::finish() {
void PendingTrace::finish(SpanSampler* span_sampler) {
// Apply changes to spans, in particular treating the root / local-root
// span as special.
for (const auto& span : *finished_spans) {
Expand All @@ -97,6 +119,15 @@ void PendingTrace::finish() {
finish_span(*this, *span);
}
}

// If we have span sampling rules and are dropping the trace, see if any
// span sampling tags need to be added.
if (span_sampler && !span_sampler->rules().empty() && sampling_priority &&
int(*sampling_priority) <= 0) {
for (const auto& span : *finished_spans) {
apply_span_sampling(*span_sampler, *span);
}
}
}

void PendingTrace::applySamplingDecisionToTraceTags() {
Expand Down
7 changes: 6 additions & 1 deletion src/pending_trace.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,12 @@ struct PendingTrace {
PendingTrace(std::shared_ptr<const Logger> logger, uint64_t trace_id,
std::unique_ptr<SamplingPriority> sampling_priority);

void finish();
// Modify span tags in order to prepare this trace for serialization. Use
// the optionally specified `span_sampler` to identify spans to keep should
// this trace be dropped. If `span_sampler` is `nullptr`, then no span
// sampling is performed.
void finish(SpanSampler *span_sampler = nullptr);

// If this tracer did not inherit a sampling decision from an upstream
// service, but instead made a sampling decision, then record that decision
// in the "_dd.p.dm" member of `trace_tags`.
Expand Down
Loading

0 comments on commit 3db282f

Please sign in to comment.