node-agent daemonset not listening on correct metrics port #508

jwitko · 2023-10-10T19:13:55Z

What steps did you take and what happened:
Deployed velero with node-agent in standard fashion including podMonitor. Scrape errors started happening because connections to port 8085 were being rejected.

What did you expect to happen:
PodMonitor would work and prometheus would be able to scrape the endpoint

Logs:

023-10-10T19:08:10.268Z	warn	VictoriaMetrics/lib/promscrape/scrapework.go:387	cannot scrape target "http://10.0.43.244:8085/metrics" ({container="node-agent",endpoint="http-monitoring",instance="10.0.43.244:8085",job="infra/node-agent",namespace="infra",pod="node-agent-x4fff"}) 1 out of 1 times during -promscrape.suppressScrapeErrorsDelay=0s; the last error: cannot read data: cannot scrape "http://10.0.43.244:8085/metrics": Get "http://10.0.43.244:8085/metrics": context canceled

Anything else you would like to add:
Looking into the issue I could see that the velero logs contained:

INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}

So I switched the daemonset config to listen on port 8080 instead of 8085 and everything worked.

helm chart version and app version (use helm list -n <YOUR NAMESPACE>): 5.1.0 , default image tag.
Kubernetes version (use kubectl version): 1.26.8
Cloud provider or hardware configuration: hardware.

The text was updated successfully, but these errors were encountered:

PRBonham · 2023-10-13T11:30:06Z

I can confirm this, it was working before the upgrade to 5.1.0:

Velero version:

velero version
Client:
        Version: v1.12.0
        Git commit: 7112c62e493b0f7570f0e7cd2088f8cad968db99
Server:
        Version: v1.12.0

Helm history:

REVISION  UPDATED                   STATUS       CHART         APP VERSION  DESCRIPTION
37        Thu Oct  5 14:16:57 2023  superseded   velero-5.0.2  1.11.1       Upgrade complete
38        Fri Oct 13 11:54:48 2023  deployed     velero-5.1.0  1.12.0       Upgrade complete

Helm chart metrics section:

# Settings for Velero's prometheus metrics. Enabled by default.
metrics:
  enabled: true
  scrapeInterval: 30s
  scrapeTimeout: 10s

  # ...

  nodeAgentPodMonitor:
    autodetect: true
    enabled: true
    annotations: {}
    additionalLabels: {}

al-cheb · 2023-10-20T08:34:49Z

Looks like it has been fixed in this vmware-tanzu/velero#6784 PR

$ ./velero node-agent server
INFO[0000] Setting log-level to INFO                    
INFO[0000] Starting Velero node-agent server  (-)        logSource="/velero/pkg/cmd/cli/nodeagent/server.go:103"
1.6977906245496273e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
INFO[0000] Starting metric server for node agent at address [:8085]  logSource="/velero/pkg/cmd/cli/nodeagent/server.go:231"

jwitko · 2023-10-20T15:52:06Z

Looks like it has been fixed in this vmware-tanzu/velero#6784 PR

$ ./velero node-agent server
INFO[0000] Setting log-level to INFO                    
INFO[0000] Starting Velero node-agent server  (-)        logSource="/velero/pkg/cmd/cli/nodeagent/server.go:103"
1.6977906245496273e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
INFO[0000] Starting metric server for node agent at address [:8085]  logSource="/velero/pkg/cmd/cli/nodeagent/server.go:231"

Yea, looks like the informational message is still incorrect but the port does appear to now be listening to 8085 as expected/documented.

Close this issue as vmware-tanzu/velero#6784 has been merged.

al-cheb · 2023-10-20T17:32:28Z

Looks like it has been fixed in this vmware-tanzu/velero#6784 PR
$ ./velero node-agent server
INFO[0000] Setting log-level to INFO                    
INFO[0000] Starting Velero node-agent server  (-)        logSource="/velero/pkg/cmd/cli/nodeagent/server.go:103"
1.6977906245496273e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
INFO[0000] Starting metric server for node agent at address [:8085]  logSource="/velero/pkg/cmd/cli/nodeagent/server.go:231"
Yea, looks like the informational message is still incorrect but the port does appear to now be listening to 8085 as expected/documented.

Close this issue as vmware-tanzu/velero#6784 has been merged.

The node-agent server runs two metric endpoinds:

controller-runtime
node-agent

$ ss -ntulp | grep -e "8080" -e "8085"
tcp   LISTEN 0      4096                                  *:8080             *:*    users:(("velero",pid=175416,fd=7))  
tcp   LISTEN 0      4096                                  *:8085             *:*    users:(("velero",pid=175416,fd=8))

controller-runtime:

$ curl -s service:8080/metrics | grep controller_runtime
# HELP controller_runtime_active_workers Number of currently used workers per controller
# TYPE controller_runtime_active_workers gauge
controller_runtime_active_workers{controller="datadownload"} 0
controller_runtime_active_workers{controller="dataupload"} 0

node-agent:

$ curl -s service:8085/metrics | grep data_upload
# HELP podVolume_data_upload_cancel_total Total number of canceled uploaded snapshots
# TYPE podVolume_data_upload_cancel_total counter
podVolume_data_upload_cancel_total{node=""} 0

bethmage · 2023-11-06T09:30:47Z

Current HelmChart version : 5.1.3 https://github.com/vmware-tanzu/velero
use the same metrices.podAnnotaions for deployment and the node-agent-deamonset !

https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/templates/node-agent-daemonset.yaml#L42
https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/templates/deployment.yaml#L45

https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/values.yaml#L219

metrics:
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8085"
    prometheus.io/path: "/metrics"

This means for node-agent-deamonset the prometheus metric endpoint is: :8085/metrics which is unreachable via http.
For the node-agent-deamonset Pods metrics endpoint should be :8080/metrics and is reachable via http.

jenting added bug Something isn't working velero labels Oct 11, 2023

jwitko closed this as completed Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node-agent daemonset not listening on correct metrics port #508

node-agent daemonset not listening on correct metrics port #508

jwitko commented Oct 10, 2023

PRBonham commented Oct 13, 2023 •

edited

Loading

al-cheb commented Oct 20, 2023

jwitko commented Oct 20, 2023

al-cheb commented Oct 20, 2023

bethmage commented Nov 6, 2023

node-agent daemonset not listening on correct metrics port #508

node-agent daemonset not listening on correct metrics port #508

Comments

jwitko commented Oct 10, 2023

PRBonham commented Oct 13, 2023 • edited Loading

al-cheb commented Oct 20, 2023

jwitko commented Oct 20, 2023

al-cheb commented Oct 20, 2023

bethmage commented Nov 6, 2023

PRBonham commented Oct 13, 2023 •

edited

Loading