Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jmx-metrics] Metrics Gatherer reports old data if target becomes unavailable #910

Closed
dehaansa opened this issue Jun 8, 2023 · 5 comments

Comments

@dehaansa
Copy link
Contributor

dehaansa commented Jun 8, 2023

Component(s)

jmx-metrics

What happened?

Description

When monitoring a JMX target with the JMX metrics gatherer, if that target becomes unresponsive or unavailable the most recently collected metrics continue to report via the configured exporter as if they are new metrics.

Steps to Reproduce

Start a simple java executable like the following

cat <<EOF > hello.java
class Sleepy {
  public static void main(String args[]) throws InterruptedException {
    while (true) {
      Thread.sleep(10000);
    }
  }
}
EOF
javac hello.java

java -ea -Dcom.sun.management.jmxremote.port=9010 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=127.0.0.1 Sleepy > /dev/null 2>&1 &

Configure the JMX Metric gatherer to monitor said target.

otel.jmx.service.url = service:jmx:rmi:///jndi/rmi://localhost:9010/jmxrmi
otel.jmx.target.system = jvm
otel.jmx.interval.milliseconds = 5000
otel.jmx.remote.registry.ssl=false
otel.metrics.exporter = prometheus
otel.exporter.prometheus.host = localhost
otel.exporter.prometheus.port = 4317

After seeing metrics collected at the prometheus endpoint, then stop the Sleepy java process. Observe that the metrics are still reported (with fresh timestamps) at the prometheus endpoint.

While the prometheus endpoint would make some sense to continue to show old data points (with correct timestamps), the real use case where this issue was encountered was utilizing the jmx receiver in opentelemetry-collector-contrib, where we would not expect old, duplicate metrics to be exported via OTLP and sent to the receiver. To observe that behavior, the jmx receiver would be configured like the following (note - the receiver can only be used with a released version of the JMX Metrics gatherer, or the collector must be built with special build flags to allow use of a non-released version):

exporters:
  file:
    path: ./output.json
receivers:
  jmx:
    collection_interval: 5s
    endpoint: localhost:9010
    jar_path: /path/to/opentelemetry-java-contrib-jmx-metrics.jar
    target_system: jvm
service:
  pipelines:
    metrics:
      exporters:
        - file
      receivers:
        - jmx

Expected Result

Metrics would at minimum not be reported with new timestamps, but preferably metrics would not be reported at all for times where the monitoring target was unavailable

Actual Result

The most recently collected value continues to be reported for any instruments, with a fresh timestamp attached.

Component version

First observed w/ 1.14.0, but has been seen w/ latest

Log output

No response

Additional context

No response

@dehaansa
Copy link
Contributor Author

dehaansa commented Jun 8, 2023

@jack-berg as discussed in the SIG, here's an issue. In regards to your specific questions about how the instruments are created, they appear to be made synchronously (if builder.build() is synchronous && builder.buildWithCallback is asynchronous). Primary interaction with instruments is in GroovyMetricEnvironment starting at line 142, and in InstrumentHelper starting on line 71.

@jack-berg
Copy link
Member

Ok so not an expert on this code by any means, but I took a look and here's what I got:

  • Groovy scripts such as the ones prebundled in target-systems define how to translate JMX metrics to otel metrics
  • An instance of this OtelHelper script is exposed to these scripts, and they use it to register instruments and dictate what type of otel isntrument to map to. For example, here the groovy script calls the instrument method of OtelHelper, and dictates that the instrument is a async long up down counter.
  • OtelHelper does expose the ability to map to synchronous instruments, like these, but they don't appear to be. used anywhere in the prebundled scripts. I'm also not sure how you would use one of the synchronous instruments in practice because you fundamentally asynchronously observing the jmx metrics - you don't have access to the raw measurements to record synchronously.
  • I think the issue lies in some part InstrumentHelper and remembering and continuing to record measurements after the mbeans are gone, though I can't be sure because that code is pretty confusing to me.
  • Here's a little test I wrote to demonstrate that async instrument do not export metrics after they stop reporting measurements. This shows that its within the control of jmx metric gatherer to stop metrics from being exported. Just a bug in the code that needs to be found / fixed.

@dehaansa
Copy link
Contributor Author

@jack-berg if you have any capacity to give some review of the proposed fix for this issue I'd appreciate the expertise on the SDK instrumentation changes.

@breedx-splk
Copy link
Contributor

Hey @dehaansa does #949 resolve this issue? Thanks.

@dehaansa
Copy link
Contributor Author

Hey @dehaansa does #949 resolve this issue? Thanks.

Yup, this is resolved

@trask trask closed this as completed Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants