Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify internal observability documentation - 3 of 3 #4529

Merged
merged 27 commits into from
Jun 14, 2024
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
e73a88f
Copy text from troubleshooting.md
tiffany76 May 23, 2024
eb1ad6a
Copy text from monitoring.md
tiffany76 May 23, 2024
89b473a
Make copy edits to internal-telemetry.md
tiffany76 May 23, 2024
fae47d4
Make small word fixes
tiffany76 May 23, 2024
667e0ca
Make linter fixes
tiffany76 May 23, 2024
9ec03f2
Add cSpell ignore words
tiffany76 May 23, 2024
def14d6
Make one more prettier fix
tiffany76 May 23, 2024
c5cfe25
Merge branch 'main' into internal-obs-3
tiffany76 May 23, 2024
89980ed
Merge branch 'main' into internal-obs-3
tiffany76 May 28, 2024
5336e53
Merge branch 'main' into internal-obs-3
tiffany76 May 29, 2024
42d9ab4
Revert CPU resources section
tiffany76 May 29, 2024
9a9bbcd
Copyedit the troubleshooting page
tiffany76 May 30, 2024
ee3aa1e
Make more text edits to internal telemetry page
tiffany76 May 30, 2024
3b0cc4d
Merge branch 'main' into internal-obs-3
tiffany76 May 30, 2024
dff6963
Merge branch 'main' into internal-obs-3
tiffany76 May 31, 2024
5904d21
Apply suggestions from Fabrizio's review
tiffany76 Jun 3, 2024
d6e227a
Merge branch 'main' into internal-obs-3
tiffany76 Jun 3, 2024
9c244d9
Results from /fix:format
opentelemetrybot Jun 3, 2024
97dec69
Merge branch 'main' into internal-obs-3
tiffany76 Jun 4, 2024
aa21f31
Make small wording and link fixes
tiffany76 Jun 4, 2024
9cf1347
Merge branch 'main' into internal-obs-3
tiffany76 Jun 10, 2024
7377638
Merge branch 'main' into internal-obs-3
tiffany76 Jun 11, 2024
27bd47b
Remove Health Check extension section
tiffany76 Jun 11, 2024
85f919c
Merge branch 'main' into internal-obs-3
tiffany76 Jun 12, 2024
87f6995
Merge branch 'main' into internal-obs-3
tiffany76 Jun 13, 2024
878c850
Remove CPU monitoring section
tiffany76 Jun 13, 2024
f1c9dfe
Merge branch 'main' into internal-obs-3
theletterf Jun 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 74 additions & 1 deletion content/en/docs/collector/internal-telemetry.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ journalctl | grep otelcol | grep Error

{{% /tab %}} {{< /tabpane >}}

## Types of internal observability
## Types of internal telemetry

The OpenTelemetry Collector aims to be a model of observable service by clearly
exposing its own operational metrics. Additionally, it collects host resource
Expand Down Expand Up @@ -272,3 +272,76 @@ The Collector logs the following internal events:
- Data dropping due to invalid data stops.
- A crash is detected, differentiated from a clean stop. Crash data is included
if available.

## Use internal telemetry to monitor the Collector

This section recommends best practices for monitoring the Collector using its
own telemetry.

### Critical monitoring

#### Data loss

Use the rate of `otelcol_processor_dropped_spans > 0` and
`otelcol_processor_dropped_metric_points > 0` to detect data loss. Depending on
your project's requirements, select a narrow time window before alerting begins
to avoid notifications for small losses that are within the desired reliability
range and not considered outages.

#### Low on CPU resources

This depends on the CPU metrics available on the deployment, eg.:
`kube_pod_container_resource_limits{resource="cpu", unit="core"}` for
Kubernetes. Let's call it `available_cores` below. The idea here is to have an
upper bound of the number of available cores, and the maximum expected ingestion
rate considered safe, let's call it `safe_rate`, per core. This should trigger
increase of resources/ instances (or raise an alert as appropriate) whenever
`(actual_rate/available_cores) < safe_rate`.
tiffany76 marked this conversation as resolved.
Show resolved Hide resolved

The `safe_rate` depends on the specific configuration being used. // TODO:
Provide reference `safe_rate` for a few selected configurations.
tiffany76 marked this conversation as resolved.
Show resolved Hide resolved

### Secondary monitoring

#### Queue length

Most exporters offer a
[queue/retry mechanism](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md)
that is recommended for use in any production deployment of the Collector.
tiffany76 marked this conversation as resolved.
Show resolved Hide resolved

The `otelcol_exporter_queue_capacity` metric indicates the capacity, in batches,
of the retry queue. The `otelcol_exporter_queue_size` metric indicates the
current size of the retry queue. Use these two metrics to check if the queue
capacity can support your workload.

Using the following three metrics, you can identify the number of spans, metric
points, and log records that failed to reach the sending queue:

- `otelcol_exporter_enqueue_failed_spans`
- `otelcol_exporter_enqueue_failed_metric_points`
- `otelcol_exporter_enqueue_failed_log_records`

These failures could be caused by a queue filled with unsettled elements. You
might need to decrease your sending rate or horizontally scale Collectors.

The queue/retry mechanism also supports logging for monitoring. Check the logs
for messages such as `Dropping data because sending_queue is full`.
tiffany76 marked this conversation as resolved.
Show resolved Hide resolved

#### Receive failures

Sustained rates of `otelcol_receiver_refused_spans` and
`otelcol_receiver_refused_metric_points` indicate that too many errors were
returned to clients. Depending on the deployment and the clients' resilience,
this might indicate clients' data loss.

Sustained rates of `otelcol_exporter_send_failed_spans` and
`otelcol_exporter_send_failed_metric_points` indicate that the Collector is not
able to export data as expected. These metrics do not inherently imply data loss
since there could be retries. But a high rate of failures could indicate issues
with the network or backend receiving the data.

#### Data flow

You can monitor data ingress with the `otelcol_receiver_accepted_spans` and
`otelcol_receiver_accepted_metric_points` metrics and data egress with the
`otecol_exporter_sent_spans` and `otelcol_exporter_sent_metric_points` metrics.
Loading