Skip to content

Commit

Permalink
OpenTelemetry Collector Codebundle (#406)
Browse files Browse the repository at this point in the history
* implementation

* script cleanup and genrule

* Update nextsteps
  • Loading branch information
jon-funk authored Jul 25, 2024
1 parent 7180b6d commit c34ec19
Show file tree
Hide file tree
Showing 8 changed files with 299 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
apiVersion: runwhen.com/v1
kind: GenerationRules
spec:
generationRules:
- resourceTypes:
- deployment
- daemonset
- statefulset
matchRules:
- type: and
matches:
- type: pattern
pattern: "opentelemetry-collector"
properties: [label-values]
mode: substring
- type: pattern
pattern: "col"
properties: [name]
mode: substring
slxs:
- baseName: k8s-otelcollector
levelOfDetail: detailed
qualifiers: ["resource", "namespace", "cluster"]
baseTemplateName: k8s-otelcollector
outputItems:
- type: slx
- type: runbook
templateName: k8s-otelcollector-taskset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
apiVersion: runwhen.com/v1
kind: Runbook
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
location: {{default_location}}
codeBundle:
{% if repo_url %}
repoUrl: {{repo_url}}
{% else %}
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git
{% endif %}
{% if ref %}
ref: {{ref}}
{% else %}
ref: main
{% endif %}
pathToRobot: codebundles/k8s-jaeger-http-query/runbook.robot
configProvided:
- name: KUBERNETES_DISTRIBUTION_BINARY
value: {{custom.kubernetes_distribution_binary}}
- name: NAMESPACE
value: {{match_resource.resource.metadata.namespace}}
- name: CONTEXT
value: {{context}}
- name: WORKLOAD_NAME
value: {{match_resource.resource.kind}}/{{match_resource.resource.metadata.name}}
- name: WORKLOAD_SERVICE
value: {{match_resource.resource.metadata.name}}
- name: METRICS_PORT
value: 8888
secretsProvided:
- name: kubeconfig
workspaceKey: {{custom.kubeconfig_secret_name}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
apiVersion: runwhen.com/v1
kind: ServiceLevelX
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
imageURL: https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/jaeger_tracing.svg
alias: OTEL Collector Health for Namespace {{match_resource.resource.metadata.namespace}}
asMeasuredBy: None
configProvided:
- name: OBJECT_NAME
value: {{match_resource.resource.metadata.name}}
owners:
- {{workspace.owner_email}}
statement: OTEL Collector {{match_resource.resource.metadata.name}} should not have large queues or error logs.
additionalContext:
namespace: "{{match_resource.resource.metadata.namespace}}"
labelMap: "{{match_resource.resource.metadata.labels}}"
cluster: "{{ cluster.name }}"
context: "{{ cluster.context }}"
31 changes: 31 additions & 0 deletions codebundles/k8s-otelcollector/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Kubernetes OpenTelemetry Health Check
Checks the OTEL collector's logs and metrics to determine its health, such as large queues or errors.

Note: if you're having trouble connecting to your otel collector, change the
deployment name to another workload in the namespace

## Tasks
`Scan OpenTelemetry Logs For Dropped Spans In Namespace `

`Check OpenTelemetry Collector Logs For Errors In Namespace`

`Query Collector Queued Spans in Namespace`

## Configuration
The TaskSet requires initialization to import necessary secrets, services, and user variables. The following variables should be set:

- `kubeconfig`: The kubeconfig secret containing access info for the cluster.
- `KUBERNETES_DISTRIBUTION_BINARY`: Which binary to use for Kubernetes CLI commands. Default value is `kubectl`.
- `CONTEXT`: The Kubernetes context to operate within.
- `NAMESPACE`: The name of the namespace to search. Leave it blank to search in all namespaces.
- `WORKLOAD_SERVICE`: Service name to curl against for metrics.
- `WORKLOAD_NAME`: Workload used for exec requests.
- `METRICS_PORT`: The port to use to request metrics from.


## Requirements
- A kubeconfig with appropriate RBAC permissions to perform the desired command.

## TODO
- [ ] Consider additional tasks

16 changes: 16 additions & 0 deletions codebundles/k8s-otelcollector/otel_dropped_check.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash

# ENV:
# CONTEXT
# NAMESPACE
# METRICS_PORT
# WORKLOAD_NAME
# WORKLOAD_SERVICE
since=60m
output=$(kubectl --context $CONTEXT -n $NAMESPACE logs service/$WORKLOAD_SERVICE --since=$since --all-containers=true | grep dropped -A 20)
if [ -n "$output" ]; then
echo -E "Dropped Spans Found:"
echo -E "$output"
exit 1
fi
exit 0
16 changes: 16 additions & 0 deletions codebundles/k8s-otelcollector/otel_error_check.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash

# ENV:
# CONTEXT
# NAMESPACE
# METRICS_PORT
# WORKLOAD_NAME
# WORKLOAD_SERVICE
since=60m
output=$(kubectl --context $CONTEXT -n $NAMESPACE logs service/$WORKLOAD_SERVICE --since=$since --all-containers=true | grep error)
if [ -n "$output" ]; then
echo -E "Error(s) Found:"
echo -E "$output"
exit 1
fi
exit 0
23 changes: 23 additions & 0 deletions codebundles/k8s-otelcollector/otel_metrics_check.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash

# ENV:
# CONTEXT
# NAMESPACE
# METRICS_PORT
# WORKLOAD_NAME
# WORKLOAD_SERVICE

THRESHOLD=500
rv=0
metrics=$(kubectl --context $CONTEXT -n $NAMESPACE exec $WORKLOAD_NAME -- curl $WORKLOAD_SERVICE:$METRICS_PORT/metrics)
queued_spans=$(echo -E "$metrics" | grep "otelcol_exporter_queue_size{")
while IFS= read -r line; do
echo "$line"
value=$(echo "$line" | awk '{print $2}')
if [ "$value" -gt "$THRESHOLD" ]; then
echo "Error: queued spans ($value) exceeds threshold ($THRESHOLD)"
rv=1

fi
done <<< "$queued_spans"
exit $rv
124 changes: 124 additions & 0 deletions codebundles/k8s-otelcollector/runbook.robot
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
*** Settings ***
Documentation This taskset performs diagnostic checks on a OpenTelemetry Collector to ensure it's pushing metrics.
Metadata Author jon-funk
Metadata Display Name K8s OpenTelemetry Collector Health
Metadata Supports GKE EKS AKS Kubernetes OpenTelemetry otel collector

Library BuiltIn
Library RW.Core
Library RW.CLI
Library RW.platform

Suite Setup Suite Initialization


*** Tasks ***
Query Collector Queued Spans in Namespace `${NAMESPACE}`
[Documentation] Query the collector metrics endpoint and inspect queue size
[Tags] otel collector metrics queued back pressure
${process}= RW.CLI.Run Bash File
... bash_file=otel_metrics_check.sh
... env=${env}
... secret_file__kubeconfig=${kubeconfig}
... timeout_seconds=180
... include_in_history=false
IF ${process.returncode} > 0
RW.Core.Add Issue title=OpenTelemetry Span Queue Growing
... severity=3
... next_steps=Check OpenTelemetry backend is available in `${NAMESPACE}` and that the collector has enough resources, and that the collector's configmap is up-to-date.
... expected=Queue size for spans should not be past threshold of 500
... actual=Queue size of 500 or larger found
... reproduce_hint=Run otel_metrics_check.sh
... details=${process.stdout}
END
RW.Core.Add Pre To Report ${process.stdout}\n

Check OpenTelemetry Collector Logs For Errors In Namespace `${NAMESPACE}`
[Documentation] Fetch logs and check for errors
[Tags] otel collector metrics errors logs
${process}= RW.CLI.Run Bash File
... bash_file=otel_error_check.sh
... env=${env}
... secret_file__kubeconfig=${kubeconfig}
... timeout_seconds=180
... include_in_history=false
IF ${process.returncode} > 0
RW.Core.Add Issue title=OpenTelemetry Collector Has Error Logs
... severity=3
... next_steps=Tail OpenTelemetry Collector Logs In Namespace `${NAMESPACE}` For Stacktraces
... expected=Logs do not contain errors
... actual=Found error logs
... reproduce_hint=Run otel_error_check.sh
... details=${process.stdout}
END
RW.Core.Add Pre To Report ${process.stdout}\n

Scan OpenTelemetry Logs For Dropped Spans In Namespace `${NAMESPACE}`
[Documentation] Query the collector logs for dropped spans from errors
[Tags] otel collector metrics errors logs dropped rejected
${process}= RW.CLI.Run Bash File
... bash_file=otel_dropped_check.sh
... env=${env}
... secret_file__kubeconfig=${kubeconfig}
... timeout_seconds=180
... include_in_history=false
IF ${process.returncode} > 0
RW.Core.Add Issue title=OpenTelemetry Collector Logs Have Dropped Spans
... severity=3
... next_steps=Tail OpenTelemetry Collector Logs In Namespace `${NAMESPACE}` For Stacktraces
... expected=Logs do not contain dropped span entries
... actual=Found dropped span entries
... reproduce_hint=Run otel_dropped_check.sh
... details=${process.stdout}
END
RW.Core.Add Pre To Report ${process.stdout}\n

*** Keywords ***
Suite Initialization
${kubeconfig}= RW.Core.Import Secret
... kubeconfig
... type=string
... description=The kubernetes kubeconfig yaml containing connection configuration used to connect to cluster(s).
... pattern=\w*
... example=For examples, start here https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/
${NAMESPACE}= RW.Core.Import User Variable NAMESPACE
... type=string
... description=The name of the Kubernetes namespace to scope actions and searching to.
... pattern=\w*
... example=my-namespace
${CONTEXT}= RW.Core.Import User Variable CONTEXT
... type=string
... description=Which Kubernetes context to operate within.
... pattern=\w*
... example=my-main-cluster
${KUBERNETES_DISTRIBUTION_BINARY}= RW.Core.Import User Variable KUBERNETES_DISTRIBUTION_BINARY
... type=string
... description=Which binary to use for Kubernetes CLI commands.
... enum=[kubectl,oc]
... example=kubectl
... default=kubectl
${WORKLOAD_SERVICE}= RW.Core.Import User Variable WORKLOAD_SERVICE
... type=string
... description=The service name used to curl the otel collector metrics endpoint.
... example=otel-demo-otelcol
... default=otel-demo-otelcol
${WORKLOAD_NAME}= RW.Core.Import User Variable WORKLOAD_NAME
... type=string
... description=The workload name to act as a bastion-host. The collector can be used, or a bastion host depending on networking requirements.
... example=deployment/otel-demo-otelcol
... default=deployment/otel-demo-otelcol
${METRICS_PORT}= RW.Core.Import User Variable METRICS_PORT
... type=string
... description=The port used by the collector to serve its metrics at. This will be scraped.
... example=8888
... default=8888
Set Suite Variable ${kubeconfig} ${kubeconfig}
Set Suite Variable ${CONTEXT} ${CONTEXT}
Set Suite Variable ${KUBERNETES_DISTRIBUTION_BINARY} ${KUBERNETES_DISTRIBUTION_BINARY}
Set Suite Variable ${NAMESPACE} ${NAMESPACE}
Set Suite Variable ${WORKLOAD_SERVICE} ${WORKLOAD_SERVICE}
Set Suite Variable ${WORKLOAD_NAME} ${WORKLOAD_NAME}
Set Suite Variable ${METRICS_PORT} ${METRICS_PORT}
Set Suite Variable
... ${env}
... {"KUBECONFIG":"./${kubeconfig.key}", "KUBERNETES_DISTRIBUTION_BINARY":"${KUBERNETES_DISTRIBUTION_BINARY}", "CONTEXT":"${CONTEXT}", "NAMESPACE":"${NAMESPACE}", "METRICS_PORT":"${METRICS_PORT}", "WORKLOAD_NAME":"${WORKLOAD_NAME}", "WORKLOAD_SERVICE":"${WORKLOAD_SERVICE}"}

0 comments on commit c34ec19

Please sign in to comment.