otel-collector otlp log receiver and performance under load #10409
Unanswered
bilsch-nice
asked this question in
Q&A
Replies: 1 comment
-
So an interesting find to maybe add some value - increasing the batch timeout from 200ms to 800ms results in fewer otel-collector pods for the same load. It increases the batch size which reduces the cpu overhead. I forgot to include that this is with version 0.92.0 I'm curious if anyone else has found similar load profiling to this. We have done additional load testing with metrics and found that we need approx 77-85 1-cpu pods to process approx 500k metric events/sec with the batch size of 800. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've been working on some load scalability testing and came across something interesting. I'm hoping someone can point me in the right direction in terms of configuration for getting higher throughput out of otel collector in the mode we are trying to run.
The setup is using opentelemetry in a relay/collector mode for logs metrics and traces using otlp on ingest. The data is split out to a grafana stack so metrics go to mimir, logs go to loki and traces go to mimir. We are using the community helm chart for setup within our kubernetes cluster. We have generated the following configuration:
please note this is slightly modified in endpoint value but otherwise untouched.
I have found a few interesting details:
Am I missing something in the configuration? Are there settings we can/need to tweak to achieve better throughput - 40 instances for 100k log events per second seems very high - thats a lot of cpu usage for not a lot of work. We are not doing anything with the messages - it's a store and forward to loki. The loki ingestion latency average is 5ms average and peaks around 50ms ( ingester pod ) so there should not be much if any back-pressure. It is surprising that scaling the cpu up per pod does not seem to work - is there some kind of request limit default we are hitting and just not taking advantage of the additional cpu per pod?
I have tried the load test in both a fixed pod count and autoscale configuration. The autoscale still scales out to 40 ( max ) pods. What is interesting maybe is that during the same 100k events test a fixed collector pod count of 40 yields no errors while the autoscaled configuration does yield errors - this could be explained by the ramp-up rate I suppose however each "user" from jmeter is starting as a new tcp connection so it should spread traffic out across each available pod at the load balancer.
Beta Was this translation helpful? Give feedback.
All reactions