This repository has been archived by the owner on Aug 14, 2024. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 224
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
325 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,292 @@ | ||
--- | ||
title: Metrics | ||
sidebar_order: 13 | ||
--- | ||
|
||
Sentry supports the emission of pre-aggregated metrics information via the `statsd` | ||
envelope item. This emission bypasses relay-side sampling and assumes a client side | ||
aggregator. In the product these are referred to as "custom metrics" and internally | ||
the system sometimes carries the name "ddm" ([delightful developer | ||
metrics](/delightful-developer-metrics/)). The functionality sits on top of the | ||
generics metrics platform which is also used for span and transaction level | ||
metrics. | ||
|
||
These metrics are considered "trace seeking" which means that they should attempt to | ||
establish a relationship to the current span or trace. | ||
|
||
## Introduction | ||
|
||
Custom metrics are sent into the `custom` namespace in the generic metrics platform. | ||
The features exposed from the generic metrics platform to the SDKs are all | ||
metric types (**counters**, **distributions**, **gauges** and **sets**). SDKs | ||
are required to perform local aggregation unless they perform on-shot emissions. | ||
|
||
## Aggregator Behavior | ||
|
||
Metrics are aggregated into a process wide aggregator which is required to flush | ||
in intervals that are reasonable for the SDK. These aggregations are lossless | ||
(unless the metric type is already lossy which for instance applies to sets | ||
and gauges) and are required to support tags. The following general guidelines | ||
exist: | ||
|
||
* **10 Second Bucketing:** SDKs are required to bucket into 10 second intervals | ||
(**rollup in seconds**) which is the current lower bound of metric accuracy. | ||
* **Flush Shift:** SDKs are required to shift the flush interval by `random() * rollup_in_seconds`. | ||
That shift is determined once per startup to create jittering. | ||
* **Force flush:** an SDK is required to perform force flushing ahead of scheduled | ||
time if the memory pressure is too high. There is no rule for this other than | ||
that SDKs should be tracking abstract aggregation complexity (eg: a counter | ||
only carries a single float, whereas a distribution is a float per emission). | ||
* **Background Aggregation:** SDKs are encouraged to defer flushing and aggregation | ||
into a background thread. For SDKs where `fork()` is to be expected, the background | ||
thread needs to ensure it restarts after fork (eg: python). | ||
* **Tagging:** metrics are tagged by key value pairs, where each key can have more | ||
than one value. Both keys and values are limited to strings, but the SDK can | ||
expose other types if it has a reasonable way to stringify these. | ||
|
||
In abstract terms the aggregator is recommended to look a bit like this: | ||
|
||
```python | ||
class Aggregator: | ||
def __init__(self): | ||
self.buckets = {} | ||
def get_bucket(self, ts): | ||
return self.buckets.setdefault(ts // 10, {}) | ||
def add(self, type, key, value, unit, tags): | ||
mri = "%s:%s" % (type, key) | ||
``` | ||
|
||
## Normalization | ||
|
||
Normalization shall be done with unicode support if possible. | ||
|
||
* **Keys and Tag Keys:** regex `[^a-zA-Z0-9_/.-]+"` with replacement character `_`. | ||
* **Tag Values:** regex `[^\w\d_:/@\.{}\[\]$-]+` with no replacement character. | ||
|
||
## Units | ||
|
||
SDKs should permit any string unit, but some of them are known to the system and | ||
will in the future permit recalculations. They are documented as part of | ||
relay: [`MetricUnit`](https://getsentry.github.io/relay/relay_metrics/enum.MetricUnit.html). | ||
|
||
## Trace Seeking | ||
|
||
When a metric is emitted it needs to find the current trace (trace seeking), | ||
most specifically the current span. From there two things shall be happening | ||
unless they are disabled: | ||
|
||
* **Common tag attaching:** the tags `transaction`, `release` and `environment` | ||
shall be automatically added to the metric. | ||
* **Span local aggregation:** the metric shall be attached to the current span | ||
as a local summary (a form of gauge). For more information see below. | ||
|
||
## API | ||
|
||
API wise the SDKs are required to expose static metric functions which are to be | ||
defined in a `metrics` module. This also documents the aggregation state for | ||
each metric type. They are constructed with an initial value, have an `add` | ||
method to add another value and a `serialize` method that returns the values for | ||
the final emission as array. | ||
|
||
### Common Attributes | ||
|
||
The following attributes are common for all metrics. These should be emitted | ||
via keyword arguments, some configuration struct or whatever is reasonable | ||
in the target language. | ||
|
||
* `key`: the name of the metric. SDKs are required to normalize this name. | ||
* `value`: the value to be emitted, this is specific by metric type. | ||
* `unit`: the unit for the metric, see below. | ||
* `tags`: a dictionary or list of tuples of tags and their values. | ||
* `stacklevel`: a utility attribute for SDKs where n stack level shall be | ||
ignored for code location recording. In languages that have better | ||
facilities this can be removed. | ||
|
||
### Span Aggregation | ||
|
||
On a span every metric is aggregated as a gauge. For each metric the following | ||
attributes are retained per span: | ||
|
||
* `min`: the minimum value observed. | ||
* `max`: the maximum value observed. | ||
* `count`: the number of emissions. | ||
* `sum`: the sum of all values. | ||
* `tags`: the metric tags and their values. | ||
|
||
The local aggregator behaves as such: | ||
|
||
```python | ||
class LocalAggregator: | ||
def add(self, type, value, value, unit, tags): | ||
# Assumes sorted tags | ||
bucket_key = ("%s:%s@%s" % (ty, key, unit), tags) | ||
|
||
old = self._measurements.get(bucket_key) | ||
if old is not None: | ||
v_min, v_max, v_count, v_sum = old | ||
v_min = min(v_min, value) | ||
v_max = max(v_max, value) | ||
v_count += 1 | ||
v_sum += value | ||
else: | ||
v_min = v_max = v_sum = value | ||
v_count = 1 | ||
self._measurements[bucket_key] = (v_min, v_max, v_count, v_sum) | ||
``` | ||
|
||
### Counters | ||
|
||
Counters (`c`) have a single method for emission: | ||
|
||
* `incr(key, value: int | float = 1.0, unit = "none", ...)`: increments the | ||
given key by one in the bucket. | ||
|
||
**Aggregation state:** | ||
|
||
```python | ||
class Counter: | ||
def __init__(self, initial): | ||
self.value = initial | ||
def add(self, value): | ||
self.value += value | ||
def serialize(self): | ||
return [self.value] | ||
``` | ||
|
||
**Span aggregation:** the value is added to the span local aggregator. | ||
|
||
### Distributions | ||
|
||
A distribution (`d`) holds all values observed. There are multiple methods that | ||
shall exist on an SDK to support common use cases. The basic way to emit a | ||
distribution is the low level `distribution` function, higher level methods | ||
depend on the facilities in the language. | ||
|
||
* `distribution(key, value: int | float, unit = "none", ...)`: emits a single | ||
value into a distribution. | ||
|
||
For languages with context managers (eg: `with` in Python): | ||
|
||
* `timing(key, ...)`: this shall measure the time of a code block and emit | ||
the measured time as distribution with a time unit (`second`, `millisecond`, | ||
etc.). The default unit shall be `second` in current SDKs as we are not yet | ||
normalizing values on the server. | ||
|
||
For languages with lambda callback patterns (eg: `timed(() => { ... })`): | ||
|
||
* `timing(key, ..., callback)`: this shall measure the time of the invoked lambda and | ||
emit the measured time as distribution with a time unit (`second`, `millisecond`, | ||
etc.). The default unit shall be `second` in current SDKs as we are not yet | ||
normalizing values on the server. | ||
|
||
For languages with decorators: | ||
|
||
* `@timing(key, ...)` or `@timed(key, ...)`: Can be used to annotate a function | ||
so that every time it is invoked, a timing is emitted. | ||
|
||
**Aggregation state:** | ||
|
||
```python | ||
class Distribution: | ||
def __init__(self, initial): | ||
self.values = [initial] | ||
def add(self, value): | ||
self.values.append(value) | ||
def serialize(self): | ||
return self.values | ||
``` | ||
|
||
**Span aggregation:** the value is added to the span local aggregator. | ||
|
||
### Gauges | ||
|
||
Gauges (`g`) are generally discouraged. They are often also called "summaries" as they | ||
are similar in nature to distributions, but somewhat lossy. They are however emitted | ||
per span when metric summaries are enabled. As they are hard to aggregate across | ||
different emitters they can be tricky to use properly in practice. | ||
|
||
They are emitted by a single function called `gauge`: | ||
|
||
* `gauge(key, value: int | float, unit = "none", ...)`: emits a single value into the | ||
gauge. | ||
|
||
**Aggregation state:** | ||
|
||
```python | ||
class Gauge: | ||
def __init__(self, initial): | ||
self.last = initial | ||
self.min = initial | ||
self.max = initial | ||
self.sum = initial | ||
self.count = 1 | ||
def add(self, value): | ||
self.last = value | ||
self.min = min(self.min, value) | ||
self.max = max(self.max, value) | ||
self.sum += value | ||
self.count += 1 | ||
def serialize(self): | ||
return ( | ||
self.last, | ||
self.min, | ||
self.max, | ||
self.sum, | ||
self.count, | ||
) | ||
``` | ||
|
||
**Span aggregation:** the value is added to the span local aggregator. | ||
|
||
### Sets | ||
|
||
Sets (`s`) are used to count unique members. SDKs are supposed to accept integers and | ||
strings at least, but they are encouraged to use CRC32 to fold them into a 32bit integer. | ||
As such the aggregation state is partially lossy. | ||
|
||
They are emitted with a single method: | ||
|
||
* `set(key, value: int | string, unit = "none", ...)`: marks a single value to be in the set. | ||
|
||
**Aggregation state:** | ||
|
||
```python | ||
class Set: | ||
def __init__(self, initial): | ||
self.value = {initial} | ||
def add(self, value): | ||
self.value.add(value) | ||
def serialize(self): | ||
def _hash(x): | ||
if isinstance(x, str): | ||
return crc32(x.encode("utf-8")) & 0xFFFFFFFF | ||
return int(x) | ||
return [_hash(value) for value in self.value] | ||
``` | ||
|
||
**Span aggregation:** the value `1` is added to the span local aggregator for new members. | ||
|
||
## Serialization | ||
|
||
Metrics are emitted in regular flush intervals (every 10 seconds, with a random offset that | ||
is detemined once per startup) as envelope items. The item type is `statsd` and the serialization | ||
format is statsd compatible. Unlike normal statsd, more than one value can be emitted separated | ||
by a colon. These values are the results of calling `serialize` or similar on the aggregation | ||
state. | ||
|
||
The format is documented as part of relay: [relay_metrics](https://getsentry.github.io/relay/relay_metrics/index.html). | ||
|
||
Additionally metric summaries retained on the spans shall be emitted as `_metric_summaries` | ||
on the metrics as documented by this RFC: [RFC-0123](https://github.com/getsentry/rfcs/blob/main/text/0123-metrics-correlation.md). | ||
|
||
## Meta Data | ||
|
||
Additionally SDKs are encouraged to capture location information once per metric. At metric | ||
emission time the system shall be collecting which code location emitted a metric and retain it. | ||
For that the same format as for the [`frame`](/sdk/event-payloads/stacktrace/) on the stacktrace | ||
shall be used. `stacklevel` number of frames shall be ignored. | ||
|
||
## Reference Implementation | ||
|
||
The Python SDK holds a reference implementation in the `metrics.py` module for the full | ||
feature set: [metrics.py](https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/metrics.py). |