Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Clean up internal observability docs #10454

Merged
merged 12 commits into from
Jun 28, 2024
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
  •  
<a href="https://opentelemetry.io/docs/collector/configuration/">Configuration</a>
&nbsp;&nbsp;&bull;&nbsp;&nbsp;
<a href="docs/monitoring.md">Monitoring</a>
<a href="https://opentelemetry.io/docs/collector/internal-telemetry/#use-internal-telemetry-to-monitor-the-collector</a>
&nbsp;&nbsp;&bull;&nbsp;&nbsp;
<a href="docs/security-best-practices.md">Security</a>
&nbsp;&nbsp;&bull;&nbsp;&nbsp;
Expand Down
71 changes: 4 additions & 67 deletions docs/monitoring.md
Original file line number Diff line number Diff line change
@@ -1,70 +1,7 @@
# Monitoring

Many metrics are provided by the Collector for its monitoring. Below some
key recommendations for alerting and monitoring are listed.
To learn how to monitor the Collector using its own telemetry, see the [Internal
telemetry] page.

## Critical Monitoring

### Data Loss

Use rate of `otelcol_processor_dropped_spans > 0` and
`otelcol_processor_dropped_metric_points > 0` to detect data loss, depending on
the requirements set up a minimal time window before alerting, avoiding
notifications for small losses that are not considered outages or within the
desired reliability level.

### Low on CPU Resources

This depends on the CPU metrics available on the deployment, eg.:
`kube_pod_container_resource_limits{resource="cpu", unit="core"}` for Kubernetes. Let's call it
`available_cores` below. The idea here is to have an upper bound of the number
of available cores, and the maximum expected ingestion rate considered safe,
let's call it `safe_rate`, per core. This should trigger increase of resources/
instances (or raise an alert as appropriate) whenever
`(actual_rate/available_cores) < safe_rate`.

The `safe_rate` depends on the specific configuration being used.
// TODO: Provide reference `safe_rate` for a few selected configurations.

## Secondary Monitoring

### Queue Length

Most exporters offer a [queue/retry mechanism](../exporter/exporterhelper/README.md)
that is recommended as the retry mechanism for the Collector and as such should
be used in any production deployment.

The `otelcol_exporter_queue_capacity` indicates the capacity of the retry queue (in batches). The `otelcol_exporter_queue_size` indicates the current size of retry queue. So you can use these two metrics to check if the queue capacity is enough for your workload.

The `otelcol_exporter_enqueue_failed_spans`, `otelcol_exporter_enqueue_failed_metric_points` and `otelcol_exporter_enqueue_failed_log_records` indicate the number of span/metric points/log records failed to be added to the sending queue. This may be cause by a queue full of unsettled elements, so you may need to decrease your sending rate or horizontally scale collectors.

The queue/retry mechanism also supports logging for monitoring. Check
the logs for messages like `"Dropping data because sending_queue is full"`.

### Receive Failures

Sustained rates of `otelcol_receiver_refused_spans` and
`otelcol_receiver_refused_metric_points` indicate too many errors returned to
clients. Depending on the deployment and the client’s resilience this may
indicate data loss at the clients.

Sustained rates of `otelcol_exporter_send_failed_spans` and
`otelcol_exporter_send_failed_metric_points` indicate that the Collector is not
able to export data as expected.
It doesn't imply data loss per se since there could be retries but a high rate
of failures could indicate issues with the network or backend receiving the
data.

## Data Flow

### Data Ingress

The `otelcol_receiver_accepted_spans` and
`otelcol_receiver_accepted_metric_points` metrics provide information about
the data ingested by the Collector.

### Data Egress

The `otecol_exporter_sent_spans` and
`otelcol_exporter_sent_metric_points`metrics provide information about
the data exported by the Collector.
[Internal telemetry]:
https://opentelemetry.io/docs/collector/internal-telemetry/#use-internal-telemetry-to-monitor-the-collector
218 changes: 106 additions & 112 deletions docs/observability.md
Original file line number Diff line number Diff line change
@@ -1,140 +1,134 @@
# OpenTelemetry Collector Observability
# OpenTelemetry Collector internal observability

## Goal
The [Internal telemetry] page on OpenTelemetry's website contains the
documentation for the Collector's internal observability, including:

The goal of this document is to have a comprehensive description of observability of the Collector and changes needed to achieve observability part of our [vision](vision.md).
- Which types of observability are emitted by the Collector.
- How to enable and configure these signals.
- How to use this telemetry to monitor your Collector instance.

## What Needs Observation
If you need to troubleshoot the Collector, see [Troubleshooting].

The following elements of the Collector need to be observable.
Read on to learn about experimental features and the project's overall vision
for internal telemetry.

### Current Values
## Experimental trace telemetry

- Resource consumption: CPU, RAM (in the future also IO - if we implement persistent queues) and any other metrics that may be available to Go apps (e.g. garbage size, etc).
The Collector does not expose traces by default, but an effort is underway to
[change this][issue7532]. The work includes supporting configuration of the
OpenTelemetry SDK used to produce the Collector's internal telemetry. This
feature is behind two feature gates:

- Receiving data rate, broken down by receivers and by data type (traces/metrics).

- Exporting data rate, broken down by exporters and by data type (traces/metrics).

- Data drop rate due to throttling, broken down by data type.

- Data drop rate due to invalid data received, broken down by data type.

- Current throttling state: Not Throttled/Throttled by Downstream/Internally Saturated.

- Incoming connection count, broken down by receiver.

- Incoming connection rate (new connections per second), broken down by receiver.

- In-memory queue size (in bytes and in units). Note: measurements in bytes may be difficult / expensive to obtain and should be used cautiously.

- Persistent queue size (when supported).

- End-to-end latency (from receiver input to exporter output). Note that with multiple receivers/exporters we potentially have NxM data paths, each with different latency (plus different pipelines in the future), so realistically we should likely expose the average of all data paths (perhaps broken down by pipeline).

- Latency broken down by pipeline elements (including exporter network roundtrip latency for request/response protocols).

“Rate” values must reflect the average rate of the last 10 seconds. Rates must exposed in bytes/sec and units/sec (e.g. spans/sec).

Note: some of the current values and rates may be calculated as derivatives of cumulative values in the backend, so it is an open question if we want to expose them separately or no.

### Cumulative Values

- Total received data, broken down by receivers and by data type (traces/metrics).

- Total exported data, broken down by exporters and by data type (traces/metrics).

- Total dropped data due to throttling, broken down by data type.

- Total dropped data due to invalid data received, broken down by data type.

- Total incoming connection count, broken down by receiver.

- Uptime since start.

### Trace or Log on Events

We want to generate the following events (log and/or send as a trace with additional data):

- Collector started/stopped.

- Collector reconfigured (if we support on-the-fly reconfiguration).

- Begin dropping due to throttling (include throttling reason, e.g. local saturation, downstream saturation, downstream unavailable, etc).

- Stop dropping due to throttling.

- Begin dropping due to invalid data (include sample/first invalid data).

- Stop dropping due to invalid data.

- Crash detected (differentiate clean stopping and crash, possibly include crash data if available).

For begin/stop events we need to define an appropriate hysteresis to avoid generating too many events. Note that begin/stop events cannot be detected in the backend simply as derivatives of current rates, the events include additional data that is not present in the current value.
```bash
--feature-gates=telemetry.useOtelWithSDKConfigurationForInternalTelemetry
```

### Host Metrics
The gate `useOtelWithSDKConfigurationForInternalTelemetry` enables the Collector
to parse any configuration that aligns with the [OpenTelemetry Configuration]
schema. Support for this schema is experimental, but it does allow telemetry to
be exported using OTLP.

The service should collect host resource metrics in addition to service's own process metrics. This may help to understand that the problem that we observe in the service is induced by a different process on the same host.
The following configuration can be used in combination with the aforementioned
feature gates to emit internal metrics and traces from the Collector to an OTLP
backend:

## How We Expose Telemetry
```yaml
service:
telemetry:
metrics:
readers:
- periodic:
interval: 5000
exporter:
otlp:
protocol: grpc/protobuf
endpoint: https://backend:4317
traces:
processors:
- batch:
exporter:
otlp:
protocol: grpc/protobuf
endpoint: https://backend2:4317
```

By default, the Collector exposes service telemetry in two ways currently:
See the [example configuration][kitchen-sink] for additional options.

- internal metrics are exposed via a Prometheus interface which defaults to port `8888`
- logs are emitted to stdout
> This configuration does not support emitting logs as there is no support for
> [logs] in the OpenTelemetry Go SDK at this time.

Traces are not exposed by default. There is an effort underway to [change this][issue7532]. The work includes supporting
configuration of the OpenTelemetry SDK used to produce the Collector's internal telemetry. This feature is
currently behind two feature gates:
You can also configure the Collector to send its own traces using the OTLP
exporter. Send the traces to an OTLP server running on the same Collector, so it
goes through configured pipelines. For example:

```bash
--feature-gates=telemetry.useOtelWithSDKConfigurationForInternalTelemetry
```yaml
service:
telemetry:
traces:
processors:
batch:
exporter:
otlp:
protocol: grpc/protobuf
endpoint: ${MY_POD_IP}:4317
```

The gate `useOtelWithSDKConfigurationForInternalTelemetry` enables the Collector to parse configuration
that aligns with the [OpenTelemetry Configuration] schema. The support for this schema is still
experimental, but it does allow telemetry to be exported via OTLP.
## Goals of internal telemetry

The following configuration can be used in combination with the feature gates aforementioned
to emit internal metrics and traces from the Collector to an OTLP backend:
The Collector's internal telemetry is an important part of fulfilling
OpenTelemetry's [project vision](vision.md). The following section explains the
priorities for making the Collector an observable service.

```yaml
service:
telemetry:
metrics:
readers:
- periodic:
interval: 5000
exporter:
otlp:
protocol: grpc/protobuf
endpoint: https://backend:4317
traces:
processors:
- batch:
exporter:
otlp:
protocol: grpc/protobuf
endpoint: https://backend2:4317
```
### Observable elements

See the configuration's [example][kitchen-sink] for additional configuration options.
The following aspects of the Collector need to be observable.

Note that this configuration does not support emitting logs as there is no support for [logs] in
OpenTelemetry Go SDK at this time.
- [Current values]
- Some of the current values and rates might be calculated as derivatives of
cumulative values in the backend, so it's an open question whether to expose
them separately or not.
- [Cumulative values]
- [Trace or log events]
- For start or stop events, an appropriate hysteresis must be defined to avoid
generating too many events. Note that start and stop events can't be
detected in the backend simply as derivatives of current rates. The events
include additional data that is not present in the current value.
- [Host metrics]
- Host metrics can help users determine if the observed problem in a service
is caused by a different process on the same host.

### Impact

We need to be able to assess the impact of these observability improvements on the core performance of the Collector.
The impact of these observability improvements on the core performance of the
Collector must be assessed.

### Configurable Level of Observability
### Configurable level of observability

Some of the metrics/traces can be high volume and may not be desirable to always observe. We should consider adding an observability verboseness “level” that allows configuring the Collector to send more or less observability data (or even finer granularity to allow turning on/off specific metrics).
Some metrics and traces can be high volume and users might not always want to
observe them. An observability verboseness “level” allows configuration of the
Collector to send more or less observability data or with even finer
granularity, to allow turning on or off specific metrics.

The default level of observability must be defined in a way that has insignificant performance impact on the service.
The default level of observability must be defined in a way that has
insignificant performance impact on the service.

[issue7532]: https://github.com/open-telemetry/opentelemetry-collector/issues/7532
[issue7454]: https://github.com/open-telemetry/opentelemetry-collector/issues/7454
[Internal telemetry]:
https://opentelemetry.io/docs/collector/internal-telemetry/
[Troubleshooting]: https://opentelemetry.io/docs/collector/troubleshooting/
[issue7532]:
https://github.com/open-telemetry/opentelemetry-collector/issues/7532
[issue7454]:
https://github.com/open-telemetry/opentelemetry-collector/issues/7454
[logs]: https://github.com/open-telemetry/opentelemetry-go/issues/3827
[OpenTelemetry Configuration]: https://github.com/open-telemetry/opentelemetry-configuration
[kitchen-sink]: https://github.com/open-telemetry/opentelemetry-configuration/blob/main/examples/kitchen-sink.yaml
[OpenTelemetry Configuration]:
https://github.com/open-telemetry/opentelemetry-configuration
[kitchen-sink]:
https://github.com/open-telemetry/opentelemetry-configuration/blob/main/examples/kitchen-sink.yaml
[Current values]:
https://opentelemetry.io/docs/collector/internal-telemetry/#values-observable-with-internal-metrics
[Cumulative values]:
https://opentelemetry.io/docs/collector/internal-telemetry/#values-observable-with-internal-metrics
[Trace or log events]:
https://opentelemetry.io/docs/collector/internal-telemetry/#events-observable-with-internal-logs
[Host metrics]:
https://opentelemetry.io/docs/collector/internal-telemetry/#lists-of-internal-metrics
Loading
Loading