Skip to content

Commit

Permalink
Fix rejection ratio alert (#508)
Browse files Browse the repository at this point in the history
OpenSearch does not have thread pool with the name bulk. However, there
is a field named "write" that can be used by bulk or writing a single
instance.

This PR adds the necessary logic to use the OpenSearch data coming from
the exporter and because the expression is big, it was separated into
smaller expressions using record.

To see the original rule that was inspiration for this alert see this
[repo](https://github.com/lukas-vlcek/prometheus-elasticsearch-rules/blob/master/logging_elasticsearch.rules.yaml)

Fix: #503
  • Loading branch information
gabrielcocenza authored Nov 29, 2024
1 parent e225961 commit 85c7868
Showing 1 changed file with 17 additions and 4 deletions.
21 changes: 17 additions & 4 deletions src/alert_rules/prometheus/prometheus_alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,19 @@
- "name": "opensearch.alerts"
"rules":

# Write requests rates
# =====================
- record: write:rejected_requests:rate2m
expr: sum by (cluster, instance, node) (rate(opensearch_threadpool_threads_count{name="write", type="rejected"}[2m]))

- record: write:total_requests:rate2m
expr: sum by (cluster, instance, node) (rate(opensearch_threadpool_threads_count{name="write"}[2m]))

# If there are no write rejections then we get can 0/0 which is NaN. This does not affect the
# OpenSearchWriteRequestsRejectionJumps alert
- record: write:reject_ratio:rate2m
expr: write:rejected_requests:rate2m / write:total_requests:rate2m

- "alert": "OpenSearchScrapeFailed"
"annotations":
"message": "Scrape on {{ $labels.juju_unit }} failed. Ensure that the OpenSearch systemd service is healthy and that the unit is part of the cluster."
Expand Down Expand Up @@ -32,12 +45,12 @@
"labels":
"severity": "warning"

- "alert": "OpenSearchBulkRequestsRejectionJumps"
- "alert": "OpenSearchWriteRequestsRejectionJumps"
"annotations":
"message": "High Bulk Rejection Ratio at {{ $labels.node }} node in {{ $labels.cluster }} cluster. This node may not be keeping up with the indexing speed."
"summary": "High Bulk Rejection Ratio - {{ $value }}%"
"message": "High Write Rejection Ratio at {{ $labels.node }} node in {{ $labels.cluster }} cluster. This node may not be keeping up with the indexing speed."
"summary": "High Write Rejection Ratio - {{ $value }}%"
"expr": |
round( bulk:reject_ratio:rate2m * 100, 0.001 ) > 5
round( write:reject_ratio:rate2m * 100, 0.001 ) > 5
"for": "10m"
"labels":
"severity": "warning"
Expand Down

0 comments on commit 85c7868

Please sign in to comment.