Jvm/Cassandra metrics stopped flowing after some time #861

junhuangli · 2023-05-05T01:45:52Z

Description

Everything is working fine at the beginning, but after some time(10 hours to 10 days depends on the “collection_interval”) the metrics from Jvm/Cassandra(for example “cassandra.client.request.range_slice.latency.99p”) stops, but all the otel internal metrics continue working(for example “otelcol_process_uptime”)
Running the second collector manually while is first one is in the “error” state works(In other words, we can see Jvm/Cassandra metrics from the second collector but not from the first one even if they are running in the same docker container)

Steps to reproduce
Deploy and then wait

Expectation
Jvm/Cassandra metrics continue flowing

What applicable config did you use?

---
receivers:
  jmx:
    jar_path: "/refinery/opentelemetry-jmx-metrics.jar"
    endpoint: localhost:7199
    target_system: cassandra,jvm
    collection_interval: 3s
    log_level: debug

  prometheus/internal:
    config:
      scrape_configs:
        - job_name: 'refinery-internal-metrics'
          scrape_interval: 10s
          static_configs:
            - targets: [ 'localhost:8888' ]
          metric_relabel_configs:
            - source_labels: [ __name__ ]
              regex: '.*grpc_io.*'
              action: drop

exporters:
  myexporter:
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    myexporter:
      host: "myexporter.net"
      port: "9443"
      enable_mtls: true
      root_path: /etc/identity
      repo_dir_path: /etc/identity/client
      service_name: client
      gzip: true
 
processors:
  netmetadata:
    metrics:
      scopes: 
        service: refinery_tested
        subservice: "cassandra"
      
      tags:
        version: "1"
        k8s_pod_name: "test-cass-alrt-eap-c02-0"
        k8s_namespace: "dva-system"
        k8s_cluster: "collection-monitoring"
        device: "ip-10-11-11-11.us-west-2.compute.internal"
        substrate: "aws"
        account: "00000"
        region: "unknown"
        zone: "us-west-2b"
        falcon_instance: "dev1-uswest2"
        functional_domain: "monitoring"
        functional_domain_instance: "monitoring"
        environment: "dev1"
        environment_type: "dev"
        cell: "c02"
        service_name: "test-cass-alrt-eap"
        service_group: "test-shared"
        service_instance: "test-cass-alrt-eap-c02"
  memory_limiter/with-settings:
    check_interval: 1s
    limit_mib: 2000
    spike_limit_mib: 400
    limit_percentage: 0
    spike_limit_percentage: 0

  batch:
    timeout: 5s
    send_batch_size: 8192
    send_batch_max_size: 0

service:
  extensions: []
  telemetry:
    logs:
      development: false
      level: debug
    metrics:
      level: detailed
      address: localhost:8888
  pipelines:
    metrics:
      receivers: ["jmx"]
      processors: [memory_limiter/with-settings, batch, netmetadata]
      exporters: [myexporter]
    metrics/internal:
      receivers: ["prometheus/internal"]
      processors: [memory_limiter/with-settings, batch, netmetadata]
      exporters: [myexporter]

Relevant Environment Information
NAME="CentOS Linux" VERSION="7 (Core)" ID="centos"

Additional context

junhuangli · 2023-05-08T20:58:11Z

The other behavior is the following metrics stuck in number "454211". Does anyone have this experience before?

otelcol_receiver_accepted_metric_points{receiver="jmx",service_instance_id="***",service_name="refinery",service_version="v0.3.0",transport="grpc"} 454211

junhuangli · 2023-05-09T20:40:00Z

Found the following error

2023-05-09T19:19:58.474Z debug subprocess/subprocess.go:287 java.lang.OutOfMemoryError: Java heap space {"kind": "receiver", "name": "jmx", "pipeline": "metrics"}

trask · 2023-05-23T20:33:41Z

@junhuangli were you able to increase the heap space and resolve the issue?

junhuangli · 2023-05-23T20:46:49Z

Thanks for taking a look at this @trask. This OutOfMemoryError only shows up when I set the collection_interval to 1s. Since the waiting time is long(from 10 hours to 10 days) I am not sure if I wait long enough will I see the same error with the longer collection_interval yet. The current workaround is to set the collection_interval to 5 minutes.

I suspect, there might be a leaking somewhere.

The other tricky part is I am using https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/jmxreceiver which calls "[OpenTelemetry JMX Metric Gatherer]" to get JMX/Cassandra metrics. So I am not sure how I can control the resource usage here.

trask · 2023-05-23T21:08:01Z

@breedx-splk @dehaansa @Mrod1598 @rmfitzpatrick do you know if it's possible to configure -Xmx via the collector configuration? (and what the default, if any, is?)

dehaansa · 2023-05-24T00:32:27Z

From what I can recall, and a quick parse of the source, the collector does not support setting -Xmx or the other various memory flags from the JMX receiver. In the interest of minimizing the potential attack surface while allowing the JMX receiver to be one of the last points of subprocess execution in the collector, there is very limited support for configuring the way the java process runs. The default value will be the default value of -Xmx for your particular collector host, which can be found in JVM documentation.

The most vulnerable part of the JMX receiver as far as code execution goes is that it runs the java command without specifying an absolute path, so you could feasibly take advantage of this vulnerability in the receiver to add a script named java to the PATH of the user running the collector that would run java with the desired parameters. As this is technically an example of exploiting a vulnerability in the collector, I'm not going to provide an example.

junhuangli · 2023-05-24T01:57:13Z

I can try but it is still kind of a workaround.

One more info, I am running the receiver in aws kubernetes as a sidecar container. And this situation is consistently happening.

smamidala1 · 2023-06-20T17:48:18Z

This issue might be related to JMXGatherer memory leak issue we are experiencing.. #926

junhuangli · 2023-06-20T19:16:17Z

Thanks @smamidala1 , will follow #926

dehaansa · 2023-07-28T17:33:23Z

#949 was merged and should be available in 1.29.0. Among other issues addressed in that PR there were memory leaks resolved which may affect this behavior. Let us know if this issue persists after that release has been made available.

junhuangli · 2023-07-28T20:29:49Z

Thanks @dehaansa !

junhuangli added the type: bug Something isn't working label May 5, 2023

trask added the component: jmx-metrics label May 22, 2023

trask added the needs author feedback Waiting for additional feedback from the author label May 23, 2023

github-actions bot removed the needs author feedback Waiting for additional feedback from the author label May 23, 2023

junhuangli mentioned this issue Jun 20, 2023

JMX Gatherer collection interval increases with time #926

Closed

trask assigned dehaansa, rmfitzpatrick, Mrod1598 and breedx-splk Sep 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jvm/Cassandra metrics stopped flowing after some time #861

Jvm/Cassandra metrics stopped flowing after some time #861

junhuangli commented May 5, 2023 •

edited

Loading

junhuangli commented May 8, 2023

junhuangli commented May 9, 2023

trask commented May 23, 2023

junhuangli commented May 23, 2023

trask commented May 23, 2023

dehaansa commented May 24, 2023

junhuangli commented May 24, 2023

smamidala1 commented Jun 20, 2023

junhuangli commented Jun 20, 2023

dehaansa commented Jul 28, 2023

junhuangli commented Jul 28, 2023

Jvm/Cassandra metrics stopped flowing after some time #861

Jvm/Cassandra metrics stopped flowing after some time #861

Comments

junhuangli commented May 5, 2023 • edited Loading

junhuangli commented May 8, 2023

junhuangli commented May 9, 2023

trask commented May 23, 2023

junhuangli commented May 23, 2023

trask commented May 23, 2023

dehaansa commented May 24, 2023

junhuangli commented May 24, 2023

smamidala1 commented Jun 20, 2023

junhuangli commented Jun 20, 2023

dehaansa commented Jul 28, 2023

junhuangli commented Jul 28, 2023

junhuangli commented May 5, 2023 •

edited

Loading