Dropping spans in production #3311

ScottKaye · 2021-10-08T14:21:43Z

ScottKaye
Oct 8, 2021

Hi everyone,

Currently my company is running Jaeger with Elasticsearch in production. Recently we've found that prometheus.metrics.jaeger_agent_thrift_udp_server_packets_dropped_total is regularly reaching astronomical levels seemingly out of nowhere. We have 22 servers and they all look something like this, where a large spike starts randomly throughout the day:

In this next graph, red represents dropped UDP packets (same as above) and blue is queue size. Two servers are shown here, but all patterns look similar with the the queue quickly filling up followed by the number of dropped UDP packets.

What we've tried

jaeger_agent_client_stats_spans_dropped_total and their associated cause labels are all showing 0, even during downtime.

We've been through the Performance Tuning Guide and Where did all my spans go?:

Our Elasticsearch cluster is pretty beefy and doesn't show any signs of not being able to keep up with ingestion.
We've enabled automatic memory scaling on the collector, raising our queue size to ~400,000 (!)
We've increased the number of workers to 200
We've watched our security logs to see if connections are being closed during these times

Infrastructure

We have a collector on each data node, and an agent on each application server. Application -> Agent is UDP, Agent -> Collector is gRPC, and Collector -> Elasticsearch is regular HTTP. All of these applications exist within the same datacentre with 10 gigabit connections

Question

Is there anything further we can try to debug why something in our stack is failing?

Thank you very much!

joe-elliott · 2021-10-08T14:26:52Z

joe-elliott
Oct 8, 2021
Collaborator

I'd first recommend using a rate function on your counter. something like:

rate(jaeger_agent_client_stats_spans_dropped_total[1m])

This will help us better understand the severity of your issue and diagnose the problem.

4 replies

ScottKaye Oct 8, 2021
Author

Hi Joe, thanks for your response.

Looks like the value for this counter has been 0 for as long as we have metrics logged for. Does this metric represent the number of dropped spans from the application (client) to the agent?

Update Oct 12: We are investigating network traffic. I will reply with an update when we have something to show!

joe-elliott Oct 8, 2021
Collaborator

Oh, it seems like you just have dropped UDP packets and, possibly, no dropped spans. Diagnosing why you are dropping UDP packets would be very difficult to from here as it could be a huge number of things. Perhaps review other processes that are running on the node when the drops occur?

What I was pointing out with the rate is simply that you are looking at a counter. Counters only increase and are generally better viewed when rate()'ed. So even though you are seeing dropped packets climb to a scary number like 70k it's possible it's only dropping a few a second. rate() will give you a better idea of the severity of this issue.

Are you seeing broken traces?

yurishkuro Oct 8, 2021
Maintainer

Small clarification: client-stats metric indicate data getting lost between client and agent, either due to udp packet loss or due to client queue overflowing and dropping data (there's explicit metric on client side for that). udp-server metric indicates that data was received by the agent but dropped due to queue overflow (ie due to delay in sending data to collector).

If you're actually dropping udp packets you can confirm it via network tools. Usually it means the agent is not getting enough cpu cycles to process networking buffers.

ScottKaye Oct 13, 2021
Author

So far it looks like our network is fine, there are no packets being dropped or held up at the firewall. The most interesting thing we've learned is that restarting each affected Jaeger Agent which we have running under NSSM immediately fixes the issue for the next few hours.

From my understanding, the agent opens a persistent connection to the collector which is then re-used to send spans. Is this correct? Something may be causing this connection to close or get jammed. If this is correct, is there a way to set a maximum time the connection lives for before it's recreated? Right now we just have a scheduler restart the agent every few hours.

Thanks so much for your help so far!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaeger - Distributed Tracing Platform

Dropping spans in production #3311

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Jaeger - Distributed Tracing Platform

Dropping spans in production #3311

ScottKaye Oct 8, 2021

What we've tried

Infrastructure

Question

Replies: 1 comment · 4 replies

joe-elliott Oct 8, 2021 Collaborator

ScottKaye Oct 8, 2021 Author

joe-elliott Oct 8, 2021 Collaborator

yurishkuro Oct 8, 2021 Maintainer

ScottKaye Oct 13, 2021 Author

ScottKaye
Oct 8, 2021

Replies: 1 comment 4 replies

joe-elliott
Oct 8, 2021
Collaborator

ScottKaye Oct 8, 2021
Author

joe-elliott Oct 8, 2021
Collaborator

yurishkuro Oct 8, 2021
Maintainer

ScottKaye Oct 13, 2021
Author