otel-collector otlp log receiver and performance under load #10409

bilsch-nice · 2024-06-14T13:20:18Z

bilsch-nice
Jun 14, 2024

I've been working on some load scalability testing and came across something interesting. I'm hoping someone can point me in the right direction in terms of configuration for getting higher throughput out of otel collector in the mode we are trying to run.

The setup is using opentelemetry in a relay/collector mode for logs metrics and traces using otlp on ingest. The data is split out to a grafana stack so metrics go to mimir, logs go to loki and traces go to mimir. We are using the community helm chart for setup within our kubernetes cluster. We have generated the following configuration:

please note this is slightly modified in endpoint value but otherwise untouched.

  relay: |
    exporters:
      debug: {}
      loki:
        endpoint: http://loki-app-dmor-loki-distributed-distributor.loki.svc:3100/loki/api/v1/push
      otlp/tempo:
        endpoint: http://tempo-app-dmor-distributor.tempo.svc:4317
        tls:
          insecure: true
      prometheusremotewrite/mimir:
        endpoint: http://mimir-app-dmor-distributor.mimir.svc:8080/api/v1/push
        headers:
          X-Scope-OrgID: anonymous
    extensions:
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
      zpages:
        endpoint: ${env:MY_POD_IP}:55679
    processors:
      batch: {}
      memory_limiter:
        check_interval: 5s
        limit_percentage: 80
        spike_limit_percentage: 25
    receivers:
      jaeger:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:14250
          thrift_compact:
            endpoint: ${env:MY_POD_IP}:6831
          thrift_http:
            endpoint: ${env:MY_POD_IP}:14268
      loki:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:3600
          http:
            endpoint: ${env:MY_POD_IP}:3500
      otlp:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:4317
          http:
            endpoint: ${env:MY_POD_IP}:4318
      prometheus:
        config:
          scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
            - targets:
              - ${env:MY_POD_IP}:8888
      zipkin:
        endpoint: ${env:MY_POD_IP}:9411
    service:
      extensions:
      - health_check
      pipelines:
        logs:
          exporters:
          - loki
          processors:
          - memory_limiter
          - batch
          receivers:
          - otlp
          - loki
        metrics:
          exporters:
          - prometheusremotewrite/mimir
          processors:
          - memory_limiter
          - batch
          receivers:
          - otlp
        traces:
          exporters:
          - otlp/tempo
          processors:
          - memory_limiter
          - batch
          receivers:
          - otlp
      telemetry:
        metrics:
          address: ${env:MY_POD_IP}:8888

I have found a few interesting details:

Running with more than 1 cpu per pod does not appear to scale. Things seem to only saturate 1 cpu
Depending on the ramp-up rate we can run into a situation where otel-collector mem_limit triggers and starts rejecting traffic
To sustain approx 100,000 log ( only ) events per second takes 40 pod instances

Am I missing something in the configuration? Are there settings we can/need to tweak to achieve better throughput - 40 instances for 100k log events per second seems very high - thats a lot of cpu usage for not a lot of work. We are not doing anything with the messages - it's a store and forward to loki. The loki ingestion latency average is 5ms average and peaks around 50ms ( ingester pod ) so there should not be much if any back-pressure. It is surprising that scaling the cpu up per pod does not seem to work - is there some kind of request limit default we are hitting and just not taking advantage of the additional cpu per pod?

I have tried the load test in both a fixed pod count and autoscale configuration. The autoscale still scales out to 40 ( max ) pods. What is interesting maybe is that during the same 100k events test a fixed collector pod count of 40 yields no errors while the autoscaled configuration does yield errors - this could be explained by the ramp-up rate I suppose however each "user" from jmeter is starting as a new tcp connection so it should spread traffic out across each available pod at the load balancer.

bilsch-nice · 2024-06-19T13:14:29Z

bilsch-nice
Jun 19, 2024
Author

So an interesting find to maybe add some value - increasing the batch timeout from 200ms to 800ms results in fewer otel-collector pods for the same load. It increases the batch size which reduces the cpu overhead.

I forgot to include that this is with version 0.92.0

I'm curious if anyone else has found similar load profiling to this. We have done additional load testing with metrics and found that we need approx 77-85 1-cpu pods to process approx 500k metric events/sec with the batch size of 800.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

otel-collector otlp log receiver and performance under load #10409

{{title}}

Replies: 1 comment

{{title}}

Select a reply

otel-collector otlp log receiver and performance under load #10409

bilsch-nice Jun 14, 2024

Replies: 1 comment

bilsch-nice Jun 19, 2024 Author

bilsch-nice
Jun 14, 2024

bilsch-nice
Jun 19, 2024
Author