-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prom-Client aggregation causing master to choke in cluster mode #628
Comments
To add a background of why we tried this to verify. In our production app which is a proxy service and uses prom-client for tracking all the metrics. The service is a very high scale service and serves a huge traffic. During one activity we had to shut down prometheus and we suddenly observed a drop of p99 latencies from 450ms to mere 50ms. The prometheus scraping happens at every 10 seconds in production. |
Thanks for the detailed write-up. I did most of the cluster code in this module, but have had no time to do any work on it recently. If you see places for improvement, a PR would be welcome. |
@zbjornson @siimon we were able identify the choking points in the code and have devised a solution around that. There were 2 major choking points.
Proposed solution -
Here are the results after making the above mentioned changes. The load and code which is being used is as mentioned in this thread at the beginning - |
I implemented a small cluster module based code. I have one master and eight workers running. I am using AggregatorRegistry class like:
The worker has one end point which responds with a 30ms latency and also a middleware to increase the number of metrics to simulate the production like behaviour.
Now I have a master that has the following code
When I run this application and put load using apache benchmark using the below configs. Also I ran a watcher to hit the cluster metric endpoint on master at every 1 second to simulate prometheus scraping. The command is given below -
I get my 95th percentile at about 190ms - 200ms and the 98th and 99th percentiles are even higher on this load. When i stop my watcher and don't hit the metric end point during the load the latencies are in the expected ranges of about 30-40ms at 95th percentile.
I tried commenting the aggregation code in the prom-client and ran the same load it ran fine again. So my understanding till this point is that the aggregation of metrics from different workers are blocking the master when the size of metrics are bigger and as the master gets choked it stops routing the requests to the workers and the workers don't get the requests which causes the latencies reported by the benchmark tool. Another thing to prove the same hypothesis is that if we see the prometheus metrics from the end point the latencies reported by the workers are always in le 50ms bucket which means that once the request reaches the worker, it always gets responded as per the expectation.
Kindly suggest a way to resolve this. Also happy to raise any PR if required
The text was updated successfully, but these errors were encountered: