Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus Java client 1.x consumes more memory than 0.x #5229

Open
michaldo opened this issue Jun 24, 2024 · 13 comments
Open

Prometheus Java client 1.x consumes more memory than 0.x #5229

michaldo opened this issue Jun 24, 2024 · 13 comments

Comments

@michaldo
Copy link

I observed that micrometer registry consumes more memory when I switch Spring Boot from 3.2.6 to 3.3.1

Environment

  • Micrometer version 1.13.1
  • Micrometer registry prometheus
  • Java version: Java 21

I attached profiler report collected by async profiler.

Green is memory consumption for micrometer-registry-prometheus: scrape() takes 446 MB
Red is memory consumption for micrometer-registry-prometheus-simpleclient (I keep Spring Boot 3.3.1 but switch to simple client): 121 MB

image

@jonatan-ivanov
Copy link
Member

Could you please verify if the two outputs are the same and you are not having four times more time series in the case of Prometheus 1.x?

How many time series (or lines) you have in your Prometheus output approximately?

Also, will the second and subsequent scrapes consume equal amount of memory or are they more lightweight?

What tool you used for the flame graph?

Will you get similar results if you look at a heap dump?

Also, Micrometer's overhead on the flame graphs seems pretty minimal to me, depending on the answer of the questions above, maybe this should be a Prometheus Client question at the end?

@jonatan-ivanov jonatan-ivanov added waiting for feedback We need additional information before we can continue and removed waiting-for-triage labels Jun 24, 2024
@michaldo
Copy link
Author

Hi Jonatan,
I check memory usage on test cloud cluster when only micrometer-registry-prometheus is changed.
So answer for question 1 and 3 is that environment is stable and memory consumption is stable.

It may be important that my application heavy use Kafka and prometheous output contains 4703 Kafka rows out of 4850 total
Flame graph is generated with Intellij

After profiler output inspection, suspected fragments of code are io.micrometer.core.instrument.config.NamingConvention#toSnakeCase and io.prometheus.metrics.model.snapshots.PrometheusNaming#replaceIllegalCharsInMetricName . They can be refactor with less memory requirements

@jonatan-ivanov
Copy link
Member

After profiler output inspection, suspected fragments of code are io.micrometer.core.instrument.config.NamingConvention#toSnakeCase and io.prometheus.metrics.model.snapshots.PrometheusNaming#replaceIllegalCharsInMetricName. They can be refactor with less memory requirements

This conflicts with your flame graphs, I don't see any of these classes on it.
NamingConvention has not been changed for about two years. PrometheusNamingConvention did change in 1.13 (Boot 3.3) but now it is doing less work and it is delegating to io.prometheus.metrics.model.snapshots.PrometheusNaming.

If you are really suspicious that io.prometheus.metrics.model.snapshots.PrometheusNaming is the root cause, you might want to create an issue for Prometheus Client.

@michaldo
Copy link
Author

Indeed, suspected code I found has small impact on flame graph.

It is hard to determine root cause of memory consumption. I see that high level code is changed and memory consumption is visible on low level code - root cause is high level code recently changed or low level code not changed for years?

Anyway, I keep my opinion that pointed out methods are inefficient and micrometer-registry-prometheus.1.13.0 use more memory that micrometer-registry-prometheus-simpleclient.1.13.0

@jonatan-ivanov
Copy link
Member

It is hard to determine root cause of memory consumption.

Analyzing a heap dump might help.

I see that high level code is changed and memory consumption is visible on low level code - root cause is high level code recently changed or low level code not changed for years?

I don't understand this question.

Anyway, I keep my opinion that pointed out methods are inefficient and micrometer-registry-prometheus.1.13.0 use more memory that micrometer-registry-prometheus-simpleclient.1.13.0

I never stated the opposite, I was trying to point out that you might opened the issue in the wrong repo, Micrometer is not the same as Prometheus Client (where io.prometheus.metrics.model.snapshots.PrometheusNaming is from).

Copy link

github-actions bot commented Jul 2, 2024

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

@michaldo
Copy link
Author

michaldo commented Jul 3, 2024

I reported that "micrometer-registry-prometheus" allocates more memory than "micrometer-registry-prometheus-simpleclient". Impact is visible on application with high Kafka usage, because Kafka client use thousands of metrics.

I reported real load flamegraph, because I expected the problem is more or less known/suspected/predicted. I wanted to give a hint for track a problem in real load.

Because problem is not known, I was asked to provide more details. It was hard to do in real environment, because real environment is switched to simple client. To provide details, I made experiments on laptop with plain client.
But on laptop I have different load, and laptop-load details differs that real-load details.
But overconsumption is still visible.

Continue with laptop load, I found that io.micrometer.core.instrument.config.NamingConvention#toSnakeCase is ineffectively implemented. I made #5271 for that.

We also discussed about correct target repo which should be fixed. In my opinion target repo is micrometer. We see that "prometheus-metrics-model" provides method PrometheusNaming#sanitizeLabelName, which calls replaceIllegalCharsInMetricName(), which allocate new String for every given String

Does it mean that "prometheus-metrics-model" library holds gun to micrometer team head and demands call the method for every scape and for each out of 7K Kafka metrics? No.

micrometer-registry-prometheus is responsible for well defined and well bounded task: return bunch of metrics on each request. Basically, it must print out something like

http_client_requests_seconds_sum{client_name="localhost",error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/productDetails?productCode={productCode}"} 0.4315684

If the task works ineffective, it must be solved within micrometer-registry-prometheus. Not put blame to "prometheus-metrics-model"

@shakuzen shakuzen added feedback-provided and removed waiting for feedback We need additional information before we can continue labels Jul 5, 2024
@shakuzen shakuzen changed the title Java client 1.x consumes more memory than 0.x Prometheus Java client 1.x consumes more memory than 0.x Jul 5, 2024
@shakuzen
Copy link
Member

shakuzen commented Jul 8, 2024

See #5288 which should bring things back in alignment in terms of NamingConvention not being called during PrometheusMeterRegistry#scrape.

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Jul 8, 2024

Continue with laptop load, I found that io.micrometer.core.instrument.config.NamingConvention#toSnakeCase is ineffectively implemented. I made #5271 for that.

It might be the case that it is ineffective but the section you fixed (it's hard to tell without a JMH benchmark) has been there for 7 years, there were no change there between 1.12 and 1.13 where you are experiencing the issue and I'm not sure if it is significant on the flame graph.

Does it mean that "prometheus-metrics-model" library holds gun to micrometer team head and demands call the method for every scape and for each out of 7K Kafka metrics? No.

On the other hand, we would like Micrometer to be as close to the "official" behavior as possible, hence the calls to the Prometheus Client.

micrometer-registry-prometheus is responsible for well defined and well bounded task: return bunch of metrics on each request. Basically, it must print out something like
...
If the task works ineffective, it must be solved within micrometer-registry-prometheus. Not put blame to "prometheus-metrics-model"

I don't think I blamed prometheus-metrics-model by saying that the issue might be in Prometheus Client because Micrometer's overhead on the flame graphs (the data you provided) seemed pretty minimal to me. I think there is a big difference between the two. :)

I think that IF there is a performance issue in a downstream dependency, it should be fixed there, not upstream since if I follow you logic then a perf issue in any library your app uses should be fixed in your app since it is using it? I don't think so, the issue should be fixed closest to its cause. Based on #5288, it seems Micrometer is calling the convention many times for each scrape. Would you be able to check if that PR fixes the issue you are seeing?

shakuzen added a commit that referenced this issue Jul 9, 2024
The naming convention was being called on each scrape for each metric to create the MetricMetadata. This is unnecessary because we already have the computed convention name specifically to avoid needing to call the convention again.

See gh-5229
@shakuzen
Copy link
Member

shakuzen commented Jul 9, 2024

The changes from #5288 are available in 1.13.2

@michaldo
Copy link
Author

michaldo commented Jul 9, 2024

Here are result of my experiments on real usage
image

I checked profiling on laptop usage, but I realized that profiler output is not precise. Profile output is rather hint to search than proof of inefficiency.

Nevertheless, my findings:

  1. Avoid calling naming convention each scrape is noticeable improvemnt.
  2. It seems now that bottleneck is io.prometheus.metrics.expositionformats.TextFormatUtil#writeEscapedLabelValue

image

@shakuzen
Copy link
Member

@michaldo thank you for trying out the changes and reporting back. It seems things are improved but still not as good as before. Do you think it makes sense to report your latest finding to the Prometheus Java client and close this issue, or do you think there's more we can do in Micrometer for this?

@michaldo
Copy link
Author

Answer needs understanding Micrometer design. Simply, it depends on Prometheus Client role. If the role is a black box, performance is Prometheus Client issue. If the role is build block, perfomance is Micrometer issue or Prometheus Client issue.

No doubts metrics should have as low impact as possible. In my case several Kafka topics are used and it cause number of metrics is really high: 7K. Some prerformance bottlenecks, not observable with several dozen metrics, are observable with few thousands. That is important context.

Nevertheless, all Kafka metrics works out of the box - I didn't add single line of code. Three parties are involved in metric transmission: Spring Boot, Micrometer, and Prometheus Client. In current case Prometheus client always modifies given string to escape special characters. Considering that input is very often stable, escaping is waste of resources. When all parties involved in transmission are standardized libraries, they can agree transmission format and avoid runtime conversion

There is difference in computer science between API and protocol. Here API is Prometheus Client and protocol is

http_client_requests_seconds_sum{client_name="localhost",error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/productDetails?productCode={productCode}"} 0.4315684

API is friendly, but not necessary extremely effiecient.

To sum up, I think that it is worth to consider concept that Micrometer is responsible for efficient metrics transmission. The Micrometer should be aware how transmission looks from source to target and arrange the process to avoid redundant, repeatable, resource-consuming actions. Especially, unnecessary String conversions should be avoided. When all of above is known, Micrometer should check if Prometheus Client matches low impact requirements and decide:
a) use Prometheus Client effieciently
b) request a change in Promethesu Client
c) use Prometheus protocol without the client

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants