Replies: 1 comment 4 replies
-
I'd first recommend using a rate function on your counter. something like:
This will help us better understand the severity of your issue and diagnose the problem. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone,
Currently my company is running Jaeger with Elasticsearch in production. Recently we've found that
prometheus.metrics.jaeger_agent_thrift_udp_server_packets_dropped_total
is regularly reaching astronomical levels seemingly out of nowhere. We have 22 servers and they all look something like this, where a large spike starts randomly throughout the day:In this next graph, red represents dropped UDP packets (same as above) and blue is queue size. Two servers are shown here, but all patterns look similar with the the queue quickly filling up followed by the number of dropped UDP packets.
What we've tried
jaeger_agent_client_stats_spans_dropped_total
and their associatedcause
labels are all showing0
, even during downtime.We've been through the Performance Tuning Guide and Where did all my spans go?:
Infrastructure
We have a collector on each data node, and an agent on each application server. Application -> Agent is UDP, Agent -> Collector is gRPC, and Collector -> Elasticsearch is regular HTTP. All of these applications exist within the same datacentre with 10 gigabit connections
Question
Is there anything further we can try to debug why something in our stack is failing?
Thank you very much!
Beta Was this translation helpful? Give feedback.
All reactions