From e73a88f5d44f272e42363a45e9e94550fdd08a5d Mon Sep 17 00:00:00 2001 From: tiffany76 <30397949+tiffany76@users.noreply.github.com> Date: Thu, 23 May 2024 11:08:13 -0700 Subject: [PATCH 01/15] Copy text from troubleshooting.md --- content/en/docs/collector/troubleshooting.md | 239 ++++++++++++++++++- 1 file changed, 228 insertions(+), 11 deletions(-) diff --git a/content/en/docs/collector/troubleshooting.md b/content/en/docs/collector/troubleshooting.md index 8278d00b678b..5b4bee04bc87 100644 --- a/content/en/docs/collector/troubleshooting.md +++ b/content/en/docs/collector/troubleshooting.md @@ -14,13 +14,6 @@ You can configure and use the Collector's own [internal telemetry](/docs/collector/internal-telemetry/) to monitor its performance. -## Sending test data - -For certain types of issues, particularly verifying configuration and debugging -network issues, it can be helpful to send a small amount of data to a collector -configured to output to local logs. For details, see -[Local exporters](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/troubleshooting.md#local-exporters). - ## Check available components in the Collector Use the following sub-command to list the available components in a Collector @@ -120,6 +113,160 @@ extensions: extension: Beta ``` +## Sending test data + +For certain types of issues, particularly verifying configuration and debugging +network issues, it can be helpful to send a small amount of data to a collector +configured to output to local logs. + +### Local exporters + +[Local exporters](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#general-information) +can be configured to inspect the data being processed by the Collector. + +For live troubleshooting purposes consider leveraging the `debug` exporter, +which can be used to confirm that data is being received, processed and exported +by the Collector. + +```yaml +receivers: + zipkin: +exporters: + debug: +service: + pipelines: + traces: + receivers: [zipkin] + processors: [] + exporters: [debug] +``` + +Get a Zipkin payload to test. For example create a file called `trace.json` that +contains: + +```json +[ + { + "traceId": "5982fe77008310cc80f1da5e10147519", + "parentId": "90394f6bcffb5d13", + "id": "67fae42571535f60", + "kind": "SERVER", + "name": "/m/n/2.6.1", + "timestamp": 1516781775726000, + "duration": 26000, + "localEndpoint": { + "serviceName": "api" + }, + "remoteEndpoint": { + "serviceName": "apip" + }, + "tags": { + "data.http_response_code": "201" + } + } +] +``` + +With the Collector running, send this payload to the Collector. For example: + +```console +$ curl -X POST localhost:9411/api/v2/spans -H'Content-Type: application/json' -d @trace.json +``` + +You should see a log entry like the following from the Collector: + +``` +2023-09-07T09:57:43.468-0700 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 2} +``` + +You can also configure the `debug` exporter so the entire payload is printed: + +```yaml +exporters: + debug: + verbosity: detailed +``` + +With the modified configuration if you re-run the test above the log output +should look like: + +``` +2023-09-07T09:57:12.820-0700 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 2} +2023-09-07T09:57:12.821-0700 info ResourceSpans #0 +Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0 +Resource attributes: + -> service.name: Str(telemetrygen) +ScopeSpans #0 +ScopeSpans SchemaURL: +InstrumentationScope telemetrygen +Span #0 + Trace ID : 0c636f29e29816ea76e6a5b8cd6601cf + Parent ID : 1a08eba9395c5243 + ID : 10cebe4b63d47cae + Name : okey-dokey + Kind : Internal + Start time : 2023-09-07 16:57:12.045933 +0000 UTC + End time : 2023-09-07 16:57:12.046058 +0000 UTC + Status code : Unset + Status message : +Attributes: + -> span.kind: Str(server) + -> net.peer.ip: Str(1.2.3.4) + -> peer.service: Str(telemetrygen) +``` + +## Extensions useful for troubleshooting + +### Health Check + +The +[health_check](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/healthcheckextension/README.md) +extension, which by default is available on all interfaces on port `13133`, can +be used to ensure the Collector is functioning properly. + +```yaml +extensions: + health_check: +service: + extensions: [health_check] +``` + +It returns a response like the following: + +```json +{ + "status": "Server available", + "upSince": "2020-11-11T04:12:31.6847174Z", + "uptime": "49.0132518s" +} +``` + +### pprof + +The +[pprof](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/pprofextension/README.md) +extension, which by default is available locally on port `1777`, allows you to +profile the Collector as it runs. This is an advanced use-case that should not +be needed in most circumstances. + +### zPages + +The +[zpages](https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/zpagesextension/README.md) +extension, which if enabled is exposed locally on port `55679`, can be used to +check receivers and exporters trace operations via `/debug/tracez`. `zpages` may +contain error logs that the Collector does not emit. + +For containerized environments it may be desirable to expose this port on a +public interface instead of just locally. This can be configured via the +extensions configuration section. For example: + +```yaml +extensions: + zpages: + endpoint: 0.0.0.0:55679 +``` + ## Checklist for debugging complex pipelines It can be difficult to isolate problems when telemetry flows through multiple @@ -136,8 +283,78 @@ following: - How is the next hop configured? - Are there any network policies that prevent data from getting in or out? -### More +## Common Issues + + + +### Collector exit/restart + +The Collector may exit/restart because: + +- Memory pressure due to missing or misconfigured + [memory_limiter](https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/memorylimiterprocessor/README.md) + processor. +- Improperly sized for load. +- Improperly configured (for example, a queue size configured higher than + available memory). +- Infrastructure resource limits (for example Kubernetes). + +### Data being dropped + +Data may be dropped for a variety of reasons, but most commonly because of an: + +- Improperly sized Collector resulting in Collector being unable to process and + export the data as fast as it is received. +- Exporter destination unavailable or accepting the data too slowly. + +To mitigate drops, it is highly recommended to configure the +[batch](https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/batchprocessor/README.md) +processor. In addition, it may be necessary to configure the +[queued retry options](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/exporterhelper#configuration) +on enabled exporters. + +### Receiving data not working + +If you are unable to receive data then this is likely because either: + +- There is a network configuration issue +- The receiver configuration is incorrect +- The receiver is defined in the `receivers` section, but not enabled in any + `pipelines` +- The client configuration is incorrect + +Check the Collector logs as well as `zpages` for potential issues. + +### Processing data not working + +Most processing issues are a result of either a misunderstanding of how the +processor works or a misconfiguration of the processor. + +Examples of misunderstanding include: + +- The attributes processors only work for "tags" on spans. Span name is handled + by the span processor. +- Processors for trace data (except tail sampling) work on individual spans. + +### Exporting data not working + +If you are unable to export to a destination then this is likely because either: + +- There is a network configuration issue +- The exporter configuration is incorrect +- The destination is unavailable + +Check the collector logs as well as `zpages` for potential issues. + +More often than not, exporting data does not work because of a network +configuration issue. This could be due to a firewall, DNS, or proxy issue. Note +that the Collector does have +[proxy support](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#proxy-support). + +### Startup failing in Windows Docker containers (v0.90.1 and earlier) -For detailed recommendations, including common problems, see -[Troubleshooting](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/troubleshooting.md) -from the Collector repository. +The process may fail to start in a Windows Docker container with the following +error: `The service process could not connect to the service controller`. In +this case the `NO_WINDOWS_SERVICE=1` environment variable should be set to force +the collector to be started as if it were running in an interactive terminal, +without attempting to run as a Windows service. From eb1ad6a40392236b2cf6227fbe1b3cd9158bbe2c Mon Sep 17 00:00:00 2001 From: tiffany76 <30397949+tiffany76@users.noreply.github.com> Date: Thu, 23 May 2024 11:34:08 -0700 Subject: [PATCH 02/15] Copy text from monitoring.md --- .../en/docs/collector/internal-telemetry.md | 73 ++++++++++++++++++- 1 file changed, 72 insertions(+), 1 deletion(-) diff --git a/content/en/docs/collector/internal-telemetry.md b/content/en/docs/collector/internal-telemetry.md index 3dfdad230978..37a77404d94e 100644 --- a/content/en/docs/collector/internal-telemetry.md +++ b/content/en/docs/collector/internal-telemetry.md @@ -133,7 +133,7 @@ journalctl | grep otelcol | grep Error {{% /tab %}} {{< /tabpane >}} -## Types of internal observability +## Types of internal telemetry The OpenTelemetry Collector aims to be a model of observable service by clearly exposing its own operational metrics. Additionally, it collects host resource @@ -272,3 +272,74 @@ The Collector logs the following internal events: - Data dropping due to invalid data stops. - A crash is detected, differentiated from a clean stop. Crash data is included if available. + +## Use internal telemetry to monitor the Collector + +This section recommends best practices for alerting and monitoring the Collector +using its own telemetry. + +### Critical monitoring + +#### Data loss + +Use rate of `otelcol_processor_dropped_spans > 0` and +`otelcol_processor_dropped_metric_points > 0` to detect data loss, depending on +the requirements set up a minimal time window before alerting, avoiding +notifications for small losses that are not considered outages or within the +desired reliability level. + +#### Low on CPU resources + +This depends on the CPU metrics available on the deployment, eg.: +`kube_pod_container_resource_limits{resource="cpu", unit="core"}` for +Kubernetes. Let's call it `available_cores` below. The idea here is to have an +upper bound of the number of available cores, and the maximum expected ingestion +rate considered safe, let's call it `safe_rate`, per core. This should trigger +increase of resources/ instances (or raise an alert as appropriate) whenever +`(actual_rate/available_cores) < safe_rate`. + +The `safe_rate` depends on the specific configuration being used. // TODO: +Provide reference `safe_rate` for a few selected configurations. + +### Secondary monitoring + +#### Queue length + +Most exporters offer a +[queue/retry mechanism](../exporter/exporterhelper/README.md) that is +recommended as the retry mechanism for the Collector and as such should be used +in any production deployment. + +The `otelcol_exporter_queue_capacity` indicates the capacity of the retry queue +(in batches). The `otelcol_exporter_queue_size` indicates the current size of +retry queue. So you can use these two metrics to check if the queue capacity is +enough for your workload. + +The `otelcol_exporter_enqueue_failed_spans`, +`otelcol_exporter_enqueue_failed_metric_points` and +`otelcol_exporter_enqueue_failed_log_records` indicate the number of span/metric +points/log records failed to be added to the sending queue. This may be cause by +a queue full of unsettled elements, so you may need to decrease your sending +rate or horizontally scale collectors. + +The queue/retry mechanism also supports logging for monitoring. Check the logs +for messages like `"Dropping data because sending_queue is full"`. + +#### Receive failures + +Sustained rates of `otelcol_receiver_refused_spans` and +`otelcol_receiver_refused_metric_points` indicate too many errors returned to +clients. Depending on the deployment and the client’s resilience this may +indicate data loss at the clients. + +Sustained rates of `otelcol_exporter_send_failed_spans` and +`otelcol_exporter_send_failed_metric_points` indicate that the Collector is not +able to export data as expected. It doesn't imply data loss per se since there +could be retries but a high rate of failures could indicate issues with the +network or backend receiving the data. + +### Data flow + +You can monitor data ingress with the `otelcol_receiver_accepted_spans` and +`otelcol_receiver_accepted_metric_points` metrics and data egress with the +`otecol_exporter_sent_spans` and `otelcol_exporter_sent_metric_points` metrics. From 89b473a6f30168369099b93ac97cd7b386bcfc41 Mon Sep 17 00:00:00 2001 From: tiffany76 <30397949+tiffany76@users.noreply.github.com> Date: Thu, 23 May 2024 12:49:39 -0700 Subject: [PATCH 03/15] Make copy edits to internal-telemetry.md --- .../en/docs/collector/internal-telemetry.md | 80 ++++++++++--------- 1 file changed, 42 insertions(+), 38 deletions(-) diff --git a/content/en/docs/collector/internal-telemetry.md b/content/en/docs/collector/internal-telemetry.md index 37a77404d94e..ae44c5f12039 100644 --- a/content/en/docs/collector/internal-telemetry.md +++ b/content/en/docs/collector/internal-telemetry.md @@ -275,52 +275,56 @@ The Collector logs the following internal events: ## Use internal telemetry to monitor the Collector -This section recommends best practices for alerting and monitoring the Collector -using its own telemetry. +This section recommends best practices for monitoring the Collector using its +own telemetry. ### Critical monitoring #### Data loss -Use rate of `otelcol_processor_dropped_spans > 0` and -`otelcol_processor_dropped_metric_points > 0` to detect data loss, depending on -the requirements set up a minimal time window before alerting, avoiding -notifications for small losses that are not considered outages or within the -desired reliability level. +Use the rate of `otelcol_processor_dropped_spans > 0` and +`otelcol_processor_dropped_metric_points > 0` to detect data loss. Depending on +your project's requirements, set up a minimal time window before alerting begins +to avoid notifications for small losses that are within the desired reliability +range and not considered outages. -#### Low on CPU resources +#### Low CPU resources -This depends on the CPU metrics available on the deployment, eg.: -`kube_pod_container_resource_limits{resource="cpu", unit="core"}` for -Kubernetes. Let's call it `available_cores` below. The idea here is to have an -upper bound of the number of available cores, and the maximum expected ingestion -rate considered safe, let's call it `safe_rate`, per core. This should trigger -increase of resources/ instances (or raise an alert as appropriate) whenever -`(actual_rate/available_cores) < safe_rate`. +To make sure your Collector is using CPU resources safely during data ingestion, +you need to set: -The `safe_rate` depends on the specific configuration being used. // TODO: -Provide reference `safe_rate` for a few selected configurations. +- An upper bound on the number of `available_cores`. The metric that tracks + `available_cores` is dependent on your deployment. For example, a Kubernetes + deployment offers the + `kube_pod_container_resource_limits{resource="cpu", unit="core"}` metric. +- The maximum ingestion rate per core that is considered safe (`safe_rate`). The + `safe_rate` depends on the specific configuration you use. + +When `(actual_rate/available_cores) < safe_rate`, an alert should be raised and +an increase in resources or instances should be triggered, as appropriate. ### Secondary monitoring #### Queue length Most exporters offer a -[queue/retry mechanism](../exporter/exporterhelper/README.md) that is -recommended as the retry mechanism for the Collector and as such should be used -in any production deployment. - -The `otelcol_exporter_queue_capacity` indicates the capacity of the retry queue -(in batches). The `otelcol_exporter_queue_size` indicates the current size of -retry queue. So you can use these two metrics to check if the queue capacity is -enough for your workload. - -The `otelcol_exporter_enqueue_failed_spans`, -`otelcol_exporter_enqueue_failed_metric_points` and -`otelcol_exporter_enqueue_failed_log_records` indicate the number of span/metric -points/log records failed to be added to the sending queue. This may be cause by -a queue full of unsettled elements, so you may need to decrease your sending -rate or horizontally scale collectors. +[queue/retry mechanism](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md) +that is recommended for use in any production deployment of the Collector. + +The `otelcol_exporter_queue_capacity` metric indicates the capacity, in batches, +of the retry queue. The `otelcol_exporter_queue_size` metric indicates the +current size of the retry queue. Use these two metrics to check if the queue +capacity can support your workload. + +Using the following three metrics, you can identify the number of spans/metric +points/log records that failed to reach the sending queue: + +- `otelcol_exporter_enqueue_failed_spans` +- `otelcol_exporter_enqueue_failed_metric_points` +- `otelcol_exporter_enqueue_failed_log_records` + +These failures could be caused by a queue filled with unsettled elements. You +might need to decrease your sending rate or horizontally scale Collectors. The queue/retry mechanism also supports logging for monitoring. Check the logs for messages like `"Dropping data because sending_queue is full"`. @@ -328,15 +332,15 @@ for messages like `"Dropping data because sending_queue is full"`. #### Receive failures Sustained rates of `otelcol_receiver_refused_spans` and -`otelcol_receiver_refused_metric_points` indicate too many errors returned to -clients. Depending on the deployment and the client’s resilience this may -indicate data loss at the clients. +`otelcol_receiver_refused_metric_points` indicate that too many errors were +returned to clients. Depending on the deployment and the clients' resilience, +this might indicate clients' data loss. Sustained rates of `otelcol_exporter_send_failed_spans` and `otelcol_exporter_send_failed_metric_points` indicate that the Collector is not -able to export data as expected. It doesn't imply data loss per se since there -could be retries but a high rate of failures could indicate issues with the -network or backend receiving the data. +able to export data as expected. These metrics do not inherently imply data loss +since there could be retries. But a high rate of failures could indicate issues +with the network or backend receiving the data. ### Data flow From fae47d4e477908a57c3a5faa809286a35997f2bc Mon Sep 17 00:00:00 2001 From: tiffany76 <30397949+tiffany76@users.noreply.github.com> Date: Thu, 23 May 2024 12:57:38 -0700 Subject: [PATCH 04/15] Make small word fixes --- content/en/docs/collector/internal-telemetry.md | 4 ++-- content/en/docs/collector/troubleshooting.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/content/en/docs/collector/internal-telemetry.md b/content/en/docs/collector/internal-telemetry.md index ae44c5f12039..e9e4ba65edec 100644 --- a/content/en/docs/collector/internal-telemetry.md +++ b/content/en/docs/collector/internal-telemetry.md @@ -284,7 +284,7 @@ own telemetry. Use the rate of `otelcol_processor_dropped_spans > 0` and `otelcol_processor_dropped_metric_points > 0` to detect data loss. Depending on -your project's requirements, set up a minimal time window before alerting begins +your project's requirements, select a minimal time window before alerting begins to avoid notifications for small losses that are within the desired reliability range and not considered outages. @@ -327,7 +327,7 @@ These failures could be caused by a queue filled with unsettled elements. You might need to decrease your sending rate or horizontally scale Collectors. The queue/retry mechanism also supports logging for monitoring. Check the logs -for messages like `"Dropping data because sending_queue is full"`. +for messages such as `"Dropping data because sending_queue is full"`. #### Receive failures diff --git a/content/en/docs/collector/troubleshooting.md b/content/en/docs/collector/troubleshooting.md index 5b4bee04bc87..456b07cd0166 100644 --- a/content/en/docs/collector/troubleshooting.md +++ b/content/en/docs/collector/troubleshooting.md @@ -285,7 +285,7 @@ following: ## Common Issues - +This section covers how to identify and resolve common Collector issues. ### Collector exit/restart From 667e0caf1d27bb82611a585417b405157c4ee338 Mon Sep 17 00:00:00 2001 From: tiffany76 <30397949+tiffany76@users.noreply.github.com> Date: Thu, 23 May 2024 13:13:05 -0700 Subject: [PATCH 05/15] Make linter fixes --- content/en/docs/collector/troubleshooting.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/en/docs/collector/troubleshooting.md b/content/en/docs/collector/troubleshooting.md index 456b07cd0166..65243c449c8c 100644 --- a/content/en/docs/collector/troubleshooting.md +++ b/content/en/docs/collector/troubleshooting.md @@ -117,7 +117,7 @@ extensions: For certain types of issues, particularly verifying configuration and debugging network issues, it can be helpful to send a small amount of data to a collector -configured to output to local logs. +configured to output to local logs. ### Local exporters @@ -169,13 +169,13 @@ contains: With the Collector running, send this payload to the Collector. For example: -```console -$ curl -X POST localhost:9411/api/v2/spans -H'Content-Type: application/json' -d @trace.json +```shell +curl -X POST localhost:9411/api/v2/spans -H'Content-Type: application/json' -d @trace.json ``` You should see a log entry like the following from the Collector: -``` +```shell 2023-09-07T09:57:43.468-0700 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 2} ``` @@ -190,7 +190,7 @@ exporters: With the modified configuration if you re-run the test above the log output should look like: -``` +```shell 2023-09-07T09:57:12.820-0700 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 2} 2023-09-07T09:57:12.821-0700 info ResourceSpans #0 Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0 From 9ec03f204e02e42bae5360a7817523b09aa2d6b4 Mon Sep 17 00:00:00 2001 From: tiffany76 <30397949+tiffany76@users.noreply.github.com> Date: Thu, 23 May 2024 13:15:47 -0700 Subject: [PATCH 06/15] Add cSpell ignore words --- content/en/docs/collector/troubleshooting.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/content/en/docs/collector/troubleshooting.md b/content/en/docs/collector/troubleshooting.md index 65243c449c8c..6d7baf303bc2 100644 --- a/content/en/docs/collector/troubleshooting.md +++ b/content/en/docs/collector/troubleshooting.md @@ -2,6 +2,8 @@ title: Troubleshooting description: Recommendations for troubleshooting the collector weight: 25 +# prettier-ignore +cSpell:ignore: pprof tracez zpages --- This page describes some options when troubleshooting the health or performance From def14d69db83f19fb1b5020d38012b1b95816f23 Mon Sep 17 00:00:00 2001 From: tiffany76 <30397949+tiffany76@users.noreply.github.com> Date: Thu, 23 May 2024 13:18:46 -0700 Subject: [PATCH 07/15] Make one more prettier fix --- content/en/docs/collector/troubleshooting.md | 1 - 1 file changed, 1 deletion(-) diff --git a/content/en/docs/collector/troubleshooting.md b/content/en/docs/collector/troubleshooting.md index 6d7baf303bc2..2e4b86a5d7e2 100644 --- a/content/en/docs/collector/troubleshooting.md +++ b/content/en/docs/collector/troubleshooting.md @@ -2,7 +2,6 @@ title: Troubleshooting description: Recommendations for troubleshooting the collector weight: 25 -# prettier-ignore cSpell:ignore: pprof tracez zpages --- From 42d9ab46fd6a73f8d8d248a7a21599ba992e2142 Mon Sep 17 00:00:00 2001 From: tiffany76 <30397949+tiffany76@users.noreply.github.com> Date: Wed, 29 May 2024 13:50:55 -0700 Subject: [PATCH 08/15] Revert CPU resources section --- .../en/docs/collector/internal-telemetry.md | 26 +++++++++---------- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/content/en/docs/collector/internal-telemetry.md b/content/en/docs/collector/internal-telemetry.md index e9e4ba65edec..1616b9c9ea5a 100644 --- a/content/en/docs/collector/internal-telemetry.md +++ b/content/en/docs/collector/internal-telemetry.md @@ -288,20 +288,18 @@ your project's requirements, select a minimal time window before alerting begins to avoid notifications for small losses that are within the desired reliability range and not considered outages. -#### Low CPU resources - -To make sure your Collector is using CPU resources safely during data ingestion, -you need to set: - -- An upper bound on the number of `available_cores`. The metric that tracks - `available_cores` is dependent on your deployment. For example, a Kubernetes - deployment offers the - `kube_pod_container_resource_limits{resource="cpu", unit="core"}` metric. -- The maximum ingestion rate per core that is considered safe (`safe_rate`). The - `safe_rate` depends on the specific configuration you use. - -When `(actual_rate/available_cores) < safe_rate`, an alert should be raised and -an increase in resources or instances should be triggered, as appropriate. +#### Low on CPU resources + +This depends on the CPU metrics available on the deployment, eg.: +`kube_pod_container_resource_limits{resource="cpu", unit="core"}` for +Kubernetes. Let's call it `available_cores` below. The idea here is to have an +upper bound of the number of available cores, and the maximum expected ingestion +rate considered safe, let's call it `safe_rate`, per core. This should trigger +increase of resources/ instances (or raise an alert as appropriate) whenever +`(actual_rate/available_cores) < safe_rate`. + +The `safe_rate` depends on the specific configuration being used. // TODO: +Provide reference `safe_rate` for a few selected configurations. ### Secondary monitoring From 9a9bbcd48a32732859c3f28fd475fe027447bbc9 Mon Sep 17 00:00:00 2001 From: tiffany76 <30397949+tiffany76@users.noreply.github.com> Date: Wed, 29 May 2024 18:29:57 -0700 Subject: [PATCH 09/15] Copyedit the troubleshooting page --- content/en/docs/collector/troubleshooting.md | 402 ++++++++++--------- 1 file changed, 218 insertions(+), 184 deletions(-) diff --git a/content/en/docs/collector/troubleshooting.md b/content/en/docs/collector/troubleshooting.md index 2e4b86a5d7e2..8f243e14c1b8 100644 --- a/content/en/docs/collector/troubleshooting.md +++ b/content/en/docs/collector/troubleshooting.md @@ -1,31 +1,135 @@ --- title: Troubleshooting -description: Recommendations for troubleshooting the collector +description: Recommendations for troubleshooting the Collector weight: 25 cSpell:ignore: pprof tracez zpages --- -This page describes some options when troubleshooting the health or performance -of the OpenTelemetry Collector. The Collector provides a variety of metrics, -logs, and extensions for debugging issues. +On this page, you can learn how to troubleshoot the health and performance of +the OpenTelemetry Collector. -## Internal telemetry +## Troubleshooting tools + +The Collector provides a variety of metrics, logs, and extensions for debugging +issues. + +### Internal telemetry You can configure and use the Collector's own [internal telemetry](/docs/collector/internal-telemetry/) to monitor its performance. -## Check available components in the Collector +### Local exporters + +For certain types of issues, such as configuration verification and network +debugging, you can send a small amount of test data to a Collector configured to +output to local logs. Using a +[local exporter](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#general-information), +you can inspect the data being processed by the Collector. + +For live troubleshooting, consider using the +[`debug` exporter](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/debugexporter/README.md), +which can confirm that the Collector is receiving, processing, and exporting +data. For example: + +```yaml +receivers: + zipkin: +exporters: + debug: +service: + pipelines: + traces: + receivers: [zipkin] + processors: [] + exporters: [debug] +``` + +To begin testing, generate a Zipkin payload. For example, you can create a file +called `trace.json` that contains: + +```json +[ + { + "traceId": "5982fe77008310cc80f1da5e10147519", + "parentId": "90394f6bcffb5d13", + "id": "67fae42571535f60", + "kind": "SERVER", + "name": "/m/n/2.6.1", + "timestamp": 1516781775726000, + "duration": 26000, + "localEndpoint": { + "serviceName": "api" + }, + "remoteEndpoint": { + "serviceName": "apip" + }, + "tags": { + "data.http_response_code": "201" + } + } +] +``` + +With the Collector running, send this payload to the Collector: + +```shell +curl -X POST localhost:9411/api/v2/spans -H'Content-Type: application/json' -d @trace.json +``` + +You should see a log entry like the following: + +```shell +2023-09-07T09:57:43.468-0700 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 2} +``` + +You can also configure the `debug` exporter so the entire payload is printed: + +```yaml +exporters: + debug: + verbosity: detailed +``` + +If you re-run the previous test with the modified configuration, the log output +looks like this: + +```shell +2023-09-07T09:57:12.820-0700 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 2} +2023-09-07T09:57:12.821-0700 info ResourceSpans #0 +Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0 +Resource attributes: + -> service.name: Str(telemetrygen) +ScopeSpans #0 +ScopeSpans SchemaURL: +InstrumentationScope telemetrygen +Span #0 + Trace ID : 0c636f29e29816ea76e6a5b8cd6601cf + Parent ID : 1a08eba9395c5243 + ID : 10cebe4b63d47cae + Name : okey-dokey + Kind : Internal + Start time : 2023-09-07 16:57:12.045933 +0000 UTC + End time : 2023-09-07 16:57:12.046058 +0000 UTC + Status code : Unset + Status message : +Attributes: + -> span.kind: Str(server) + -> net.peer.ip: Str(1.2.3.4) + -> peer.service: Str(telemetrygen) +``` + +### Check Collector components Use the following sub-command to list the available components in a Collector distribution, including their stability levels. Please note that the output -format may change across versions. +format might change across versions. -```sh +```shell otelcol components ``` -Sample output +Sample output: ```yaml buildinfo: @@ -114,116 +218,16 @@ extensions: extension: Beta ``` -## Sending test data +### Extensions -For certain types of issues, particularly verifying configuration and debugging -network issues, it can be helpful to send a small amount of data to a collector -configured to output to local logs. +Here is a list of extensions you can enable for debugging the Collector. -### Local exporters - -[Local exporters](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#general-information) -can be configured to inspect the data being processed by the Collector. - -For live troubleshooting purposes consider leveraging the `debug` exporter, -which can be used to confirm that data is being received, processed and exported -by the Collector. - -```yaml -receivers: - zipkin: -exporters: - debug: -service: - pipelines: - traces: - receivers: [zipkin] - processors: [] - exporters: [debug] -``` - -Get a Zipkin payload to test. For example create a file called `trace.json` that -contains: - -```json -[ - { - "traceId": "5982fe77008310cc80f1da5e10147519", - "parentId": "90394f6bcffb5d13", - "id": "67fae42571535f60", - "kind": "SERVER", - "name": "/m/n/2.6.1", - "timestamp": 1516781775726000, - "duration": 26000, - "localEndpoint": { - "serviceName": "api" - }, - "remoteEndpoint": { - "serviceName": "apip" - }, - "tags": { - "data.http_response_code": "201" - } - } -] -``` - -With the Collector running, send this payload to the Collector. For example: - -```shell -curl -X POST localhost:9411/api/v2/spans -H'Content-Type: application/json' -d @trace.json -``` - -You should see a log entry like the following from the Collector: - -```shell -2023-09-07T09:57:43.468-0700 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 2} -``` - -You can also configure the `debug` exporter so the entire payload is printed: - -```yaml -exporters: - debug: - verbosity: detailed -``` - -With the modified configuration if you re-run the test above the log output -should look like: - -```shell -2023-09-07T09:57:12.820-0700 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 2} -2023-09-07T09:57:12.821-0700 info ResourceSpans #0 -Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0 -Resource attributes: - -> service.name: Str(telemetrygen) -ScopeSpans #0 -ScopeSpans SchemaURL: -InstrumentationScope telemetrygen -Span #0 - Trace ID : 0c636f29e29816ea76e6a5b8cd6601cf - Parent ID : 1a08eba9395c5243 - ID : 10cebe4b63d47cae - Name : okey-dokey - Kind : Internal - Start time : 2023-09-07 16:57:12.045933 +0000 UTC - End time : 2023-09-07 16:57:12.046058 +0000 UTC - Status code : Unset - Status message : -Attributes: - -> span.kind: Str(server) - -> net.peer.ip: Str(1.2.3.4) - -> peer.service: Str(telemetrygen) -``` - -## Extensions useful for troubleshooting - -### Health Check +#### Health Check The -[health_check](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/healthcheckextension/README.md) -extension, which by default is available on all interfaces on port `13133`, can -be used to ensure the Collector is functioning properly. +[Health Check extension](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/healthcheckextension/README.md), +which by default is available on all interfaces on port `13133`, can be used to +ensure the Collector is functioning properly. For example: ```yaml extensions: @@ -242,25 +246,44 @@ It returns a response like the following: } ``` -### pprof +{{% alert title="Caution" color="warning" %}} + +The optional `health_check` configuration setting, `check_collector_pipeline`, +is not working as expected. Avoid using this feature. Efforts are underway to +create a new version of the Health Check extension that relies on individual +component statuses. The extension's configuration remains unchanged until this +replacement is available. + +{{% /alert %}} + +#### Performance Profiler (pprof) The -[pprof](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/pprofextension/README.md) -extension, which by default is available locally on port `1777`, allows you to -profile the Collector as it runs. This is an advanced use-case that should not -be needed in most circumstances. +[pprof extension](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/pprofextension/README.md), +which is available locally on port `1777`, allows you to profile the Collector +as it runs. This is an advanced use-case that should not be needed in most +circumstances. -### zPages +#### zPages The -[zpages](https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/zpagesextension/README.md) -extension, which if enabled is exposed locally on port `55679`, can be used to -check receivers and exporters trace operations via `/debug/tracez`. `zpages` may -contain error logs that the Collector does not emit. +[zPages extension](https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/zpagesextension/README.md), +which is exposed locally on port `55679`, can be used to inspect live data from +the Collector's receivers and exporters. + +The TraceZ page, exposed at `/debug/tracez`, is useful for debugging trace +operations, such as: + +- Latency issues. Find the slow parts of an application. +- Deadlocks and instrumentation problems. Identify running spans that don't end. +- Errors. Determine what types of errors are occurring and where they happen. -For containerized environments it may be desirable to expose this port on a -public interface instead of just locally. This can be configured via the -extensions configuration section. For example: +Note that `zpages` might contain error logs that the Collector does not emit +itself. + +For containerized environments, you might want to expose this port on a public +interface instead of just locally. The `endpoint` can be configured using the +`extensions` configuration section: ```yaml extensions: @@ -271,91 +294,102 @@ extensions: ## Checklist for debugging complex pipelines It can be difficult to isolate problems when telemetry flows through multiple -collectors and networks. For each "hop" of telemetry data through a collector or -other component in your telemetry pipeline, it’s important to verify the -following: +Collectors and networks. For each "hop" of telemetry through a Collector or +other component in your pipeline, it’s important to verify the following: -- Are there error messages in the logs of the collector? +- Are there error messages in the logs of the Collector? - How is the telemetry being ingested into this component? -- How is the telemetry being modified (i.e. sampling, redacting) by this - component? +- How is the telemetry being modified (for example, sampling or redacting) by + this component? - How is the telemetry being exported from this component? - What format is the telemetry in? - How is the next hop configured? - Are there any network policies that prevent data from getting in or out? -## Common Issues - -This section covers how to identify and resolve common Collector issues. +## Common Collector issues -### Collector exit/restart +This section covers how to resolve common Collector issues. -The Collector may exit/restart because: +### Collector is experiencing data issues -- Memory pressure due to missing or misconfigured - [memory_limiter](https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/memorylimiterprocessor/README.md) - processor. -- Improperly sized for load. -- Improperly configured (for example, a queue size configured higher than - available memory). -- Infrastructure resource limits (for example Kubernetes). +The Collector and its components might experience data issues. -### Data being dropped +#### Collector is dropping data -Data may be dropped for a variety of reasons, but most commonly because of an: +The Collector might drop data for a variety of reasons, but the most common are: -- Improperly sized Collector resulting in Collector being unable to process and +- The Collector is improperly sized, resulting in an inability to process and export the data as fast as it is received. -- Exporter destination unavailable or accepting the data too slowly. +- The exporter destination is unavailable or accepting the data too slowly. -To mitigate drops, it is highly recommended to configure the -[batch](https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/batchprocessor/README.md) -processor. In addition, it may be necessary to configure the +To mitigate drops, configure the +[`batch` processor](https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/batchprocessor/README.md). +In addition, it might be necessary to configure the [queued retry options](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/exporterhelper#configuration) on enabled exporters. -### Receiving data not working - -If you are unable to receive data then this is likely because either: +#### Collector is not receiving data -- There is a network configuration issue -- The receiver configuration is incorrect -- The receiver is defined in the `receivers` section, but not enabled in any - `pipelines` -- The client configuration is incorrect +The Collector might not receive data for the following reasons: -Check the Collector logs as well as `zpages` for potential issues. +- A network configuration issue. +- An incorrect receiver configuration. +- An incorrect client configuration. +- The receiver is defined in the `receivers` section but not enabled in any + `pipelines`. -### Processing data not working +Check the Collector's +[logs](/docs/collector/internal-telemetry/#configure-internal-logs) as well as +[zPages](https://github.com/open-telemetry/opentelemetry-collector/blob/main/extension/zpagesextension/README.md) +for potential issues. -Most processing issues are a result of either a misunderstanding of how the -processor works or a misconfiguration of the processor. +#### Collector is not processing data -Examples of misunderstanding include: +Most processing issues result from of a misunderstanding of how the processor +works or a misconfiguration of the processor. For example: -- The attributes processors only work for "tags" on spans. Span name is handled - by the span processor. -- Processors for trace data (except tail sampling) work on individual spans. +- The attributes processor works only for "tags" on spans. The span name is + handled by the span processor. +- Processors for trace data (except tail sampling) work only on individual + spans. -### Exporting data not working +#### Collector is not exporting data -If you are unable to export to a destination then this is likely because either: +The Collector might not export data for the following reasons: -- There is a network configuration issue -- The exporter configuration is incorrect -- The destination is unavailable +- A network configuration issue. +- An incorrect exporter configuration. +- The destination is unavailable. -Check the collector logs as well as `zpages` for potential issues. +Check the Collector's +[logs](/docs/collector/internal-telemetry/#configure-internal-logs) as well as +[zPages](https://github.com/open-telemetry/opentelemetry-collector/blob/main/extension/zpagesextension/README.md) +for potential issues. -More often than not, exporting data does not work because of a network -configuration issue. This could be due to a firewall, DNS, or proxy issue. Note -that the Collector does have +Exporting data often does not work because of a network configuration issue, +such as a firewall, DNS, or proxy issue. Note that the Collector does have [proxy support](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#proxy-support). -### Startup failing in Windows Docker containers (v0.90.1 and earlier) +### Collector is experiencing control issues + +The Collector might experience failed startups or unexpected exits or restarts. + +#### Collector exits or restarts + +The Collector might exit or restart due to: + +- Memory pressure from a missing or misconfigured + [`memory_limiter` processor](https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/memorylimiterprocessor/README.md). +- Improper sizing for load. +- Improper configuration. For example, a queue size configured higher than + available memory. +- Infrastructure resource limits. For example, Kubernetes. + +#### Collector fails to start in Windows Docker containers -The process may fail to start in a Windows Docker container with the following -error: `The service process could not connect to the service controller`. In -this case the `NO_WINDOWS_SERVICE=1` environment variable should be set to force -the collector to be started as if it were running in an interactive terminal, -without attempting to run as a Windows service. +With v0.90.1 and earlier, the Collector might fail to start in a Windows Docker +container, producing the error message +`The service process could not connect to the service controller`. In this case, +the `NO_WINDOWS_SERVICE=1` environment variable must be set to force the +Collector to start as if it were running in an interactive terminal, without +attempting to run as a Windows service. From ee3aa1eac6f4af451860bddca124da89eb461531 Mon Sep 17 00:00:00 2001 From: tiffany76 <30397949+tiffany76@users.noreply.github.com> Date: Wed, 29 May 2024 18:36:43 -0700 Subject: [PATCH 10/15] Make more text edits to internal telemetry page --- content/en/docs/collector/internal-telemetry.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/en/docs/collector/internal-telemetry.md b/content/en/docs/collector/internal-telemetry.md index 1616b9c9ea5a..070dc0515223 100644 --- a/content/en/docs/collector/internal-telemetry.md +++ b/content/en/docs/collector/internal-telemetry.md @@ -284,7 +284,7 @@ own telemetry. Use the rate of `otelcol_processor_dropped_spans > 0` and `otelcol_processor_dropped_metric_points > 0` to detect data loss. Depending on -your project's requirements, select a minimal time window before alerting begins +your project's requirements, select a narrow time window before alerting begins to avoid notifications for small losses that are within the desired reliability range and not considered outages. @@ -314,8 +314,8 @@ of the retry queue. The `otelcol_exporter_queue_size` metric indicates the current size of the retry queue. Use these two metrics to check if the queue capacity can support your workload. -Using the following three metrics, you can identify the number of spans/metric -points/log records that failed to reach the sending queue: +Using the following three metrics, you can identify the number of spans, metric +points, and log records that failed to reach the sending queue: - `otelcol_exporter_enqueue_failed_spans` - `otelcol_exporter_enqueue_failed_metric_points` @@ -325,7 +325,7 @@ These failures could be caused by a queue filled with unsettled elements. You might need to decrease your sending rate or horizontally scale Collectors. The queue/retry mechanism also supports logging for monitoring. Check the logs -for messages such as `"Dropping data because sending_queue is full"`. +for messages such as `Dropping data because sending_queue is full`. #### Receive failures @@ -340,7 +340,7 @@ able to export data as expected. These metrics do not inherently imply data loss since there could be retries. But a high rate of failures could indicate issues with the network or backend receiving the data. -### Data flow +#### Data flow You can monitor data ingress with the `otelcol_receiver_accepted_spans` and `otelcol_receiver_accepted_metric_points` metrics and data egress with the From 5904d2160e5f4f6588929df44a7b94fa62bf62a6 Mon Sep 17 00:00:00 2001 From: Tiffany Hrabusa <30397949+tiffany76@users.noreply.github.com> Date: Mon, 3 Jun 2024 14:51:22 -0700 Subject: [PATCH 11/15] Apply suggestions from Fabrizio's review Co-authored-by: Fabrizio Ferri-Benedetti --- content/en/docs/collector/internal-telemetry.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/en/docs/collector/internal-telemetry.md b/content/en/docs/collector/internal-telemetry.md index 070dc0515223..b1a95067ba94 100644 --- a/content/en/docs/collector/internal-telemetry.md +++ b/content/en/docs/collector/internal-telemetry.md @@ -292,7 +292,7 @@ range and not considered outages. This depends on the CPU metrics available on the deployment, eg.: `kube_pod_container_resource_limits{resource="cpu", unit="core"}` for -Kubernetes. Let's call it `available_cores` below. The idea here is to have an +Kubernetes. Let's call it `available_cores`. The idea here is to have an upper bound of the number of available cores, and the maximum expected ingestion rate considered safe, let's call it `safe_rate`, per core. This should trigger increase of resources/ instances (or raise an alert as appropriate) whenever @@ -305,8 +305,8 @@ Provide reference `safe_rate` for a few selected configurations. #### Queue length -Most exporters offer a -[queue/retry mechanism](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md) +Most exporters provide a +[queue or retry mechanism](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md) that is recommended for use in any production deployment of the Collector. The `otelcol_exporter_queue_capacity` metric indicates the capacity, in batches, @@ -324,8 +324,8 @@ points, and log records that failed to reach the sending queue: These failures could be caused by a queue filled with unsettled elements. You might need to decrease your sending rate or horizontally scale Collectors. -The queue/retry mechanism also supports logging for monitoring. Check the logs -for messages such as `Dropping data because sending_queue is full`. +The queue or retry mechanism also supports logging for monitoring. Check the +logs for messages such as `Dropping data because sending_queue is full`. #### Receive failures From 9c244d938e4a095a3b156103771d5e7af75a8367 Mon Sep 17 00:00:00 2001 From: opentelemetrybot <107717825+opentelemetrybot@users.noreply.github.com> Date: Mon, 3 Jun 2024 21:56:37 +0000 Subject: [PATCH 12/15] Results from /fix:format --- content/en/docs/collector/internal-telemetry.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/en/docs/collector/internal-telemetry.md b/content/en/docs/collector/internal-telemetry.md index ad33cc5d02e1..76d31157b105 100644 --- a/content/en/docs/collector/internal-telemetry.md +++ b/content/en/docs/collector/internal-telemetry.md @@ -292,9 +292,9 @@ range and not considered outages. This depends on the CPU metrics available on the deployment, eg.: `kube_pod_container_resource_limits{resource="cpu", unit="core"}` for -Kubernetes. Let's call it `available_cores`. The idea here is to have an -upper bound of the number of available cores, and the maximum expected ingestion -rate considered safe, let's call it `safe_rate`, per core. This should trigger +Kubernetes. Let's call it `available_cores`. The idea here is to have an upper +bound of the number of available cores, and the maximum expected ingestion rate +considered safe, let's call it `safe_rate`, per core. This should trigger increase of resources/ instances (or raise an alert as appropriate) whenever `(actual_rate/available_cores) < safe_rate`. From aa21f3195338c9fa61a5964ac0ffab1c1edced84 Mon Sep 17 00:00:00 2001 From: tiffany76 <30397949+tiffany76@users.noreply.github.com> Date: Tue, 4 Jun 2024 15:40:54 -0700 Subject: [PATCH 13/15] Make small wording and link fixes --- content/en/docs/collector/internal-telemetry.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/content/en/docs/collector/internal-telemetry.md b/content/en/docs/collector/internal-telemetry.md index 76d31157b105..6ce1dee0a61b 100644 --- a/content/en/docs/collector/internal-telemetry.md +++ b/content/en/docs/collector/internal-telemetry.md @@ -5,10 +5,11 @@ weight: 25 cSpell:ignore: alloc journalctl kube otecol pprof tracez underperforming zpages --- -You can monitor the health of any OpenTelemetry Collector instance by checking +You can inspect the health of any OpenTelemetry Collector instance by checking its own internal telemetry. Read on to learn about this telemetry and how to -configure it to help you [troubleshoot](/docs/collector/troubleshooting/) -Collector issues. +configure it to help you +[monitor](#use-internal-telemetry-to-monitor-the-collector) and +[troubleshoot](/docs/collector/troubleshooting/) the Collector. ## Activate internal telemetry in the Collector @@ -97,9 +98,9 @@ critical analysis. ### Configure internal logs Log output is found in `stderr`. You can configure logs in the config -`service::telemetry::logs`. The [configuration -options](https://github.com/open-telemetry/opentelemetry-collector/blob/v{{% param -vers %}}/service/telemetry/config.go) are: +`service::telemetry::logs`. The +[configuration options](https://github.com/open-telemetry/opentelemetry-collector/blob/main/service/telemetry/config.go) +are: | Field name | Default value | Description | | ---------------------- | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | From 27bd47b581fd17b37caf908f0636ce4eb09439ca Mon Sep 17 00:00:00 2001 From: tiffany76 <30397949+tiffany76@users.noreply.github.com> Date: Tue, 11 Jun 2024 12:07:48 -0700 Subject: [PATCH 14/15] Remove Health Check extension section --- content/en/docs/collector/troubleshooting.md | 34 -------------------- 1 file changed, 34 deletions(-) diff --git a/content/en/docs/collector/troubleshooting.md b/content/en/docs/collector/troubleshooting.md index 8f243e14c1b8..e48030b648fb 100644 --- a/content/en/docs/collector/troubleshooting.md +++ b/content/en/docs/collector/troubleshooting.md @@ -222,40 +222,6 @@ extensions: Here is a list of extensions you can enable for debugging the Collector. -#### Health Check - -The -[Health Check extension](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/healthcheckextension/README.md), -which by default is available on all interfaces on port `13133`, can be used to -ensure the Collector is functioning properly. For example: - -```yaml -extensions: - health_check: -service: - extensions: [health_check] -``` - -It returns a response like the following: - -```json -{ - "status": "Server available", - "upSince": "2020-11-11T04:12:31.6847174Z", - "uptime": "49.0132518s" -} -``` - -{{% alert title="Caution" color="warning" %}} - -The optional `health_check` configuration setting, `check_collector_pipeline`, -is not working as expected. Avoid using this feature. Efforts are underway to -create a new version of the Health Check extension that relies on individual -component statuses. The extension's configuration remains unchanged until this -replacement is available. - -{{% /alert %}} - #### Performance Profiler (pprof) The From 878c850a3ca8e845ae3f06f6cc1ccbf29e9cd709 Mon Sep 17 00:00:00 2001 From: tiffany76 <30397949+tiffany76@users.noreply.github.com> Date: Thu, 13 Jun 2024 15:39:48 -0700 Subject: [PATCH 15/15] Remove CPU monitoring section --- content/en/docs/collector/internal-telemetry.md | 13 ------------- 1 file changed, 13 deletions(-) diff --git a/content/en/docs/collector/internal-telemetry.md b/content/en/docs/collector/internal-telemetry.md index 6ce1dee0a61b..b54a555eca02 100644 --- a/content/en/docs/collector/internal-telemetry.md +++ b/content/en/docs/collector/internal-telemetry.md @@ -289,19 +289,6 @@ your project's requirements, select a narrow time window before alerting begins to avoid notifications for small losses that are within the desired reliability range and not considered outages. -#### Low on CPU resources - -This depends on the CPU metrics available on the deployment, eg.: -`kube_pod_container_resource_limits{resource="cpu", unit="core"}` for -Kubernetes. Let's call it `available_cores`. The idea here is to have an upper -bound of the number of available cores, and the maximum expected ingestion rate -considered safe, let's call it `safe_rate`, per core. This should trigger -increase of resources/ instances (or raise an alert as appropriate) whenever -`(actual_rate/available_cores) < safe_rate`. - -The `safe_rate` depends on the specific configuration being used. // TODO: -Provide reference `safe_rate` for a few selected configurations. - ### Secondary monitoring #### Queue length