Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-agent daemonset not listening on correct metrics port #508

Closed
jwitko opened this issue Oct 10, 2023 · 5 comments
Closed

node-agent daemonset not listening on correct metrics port #508

jwitko opened this issue Oct 10, 2023 · 5 comments
Labels
bug Something isn't working velero

Comments

@jwitko
Copy link

jwitko commented Oct 10, 2023

What steps did you take and what happened:
Deployed velero with node-agent in standard fashion including podMonitor. Scrape errors started happening because connections to port 8085 were being rejected.

What did you expect to happen:
PodMonitor would work and prometheus would be able to scrape the endpoint

Logs:

023-10-10T19:08:10.268Z	warn	VictoriaMetrics/lib/promscrape/scrapework.go:387	cannot scrape target "http://10.0.43.244:8085/metrics" ({container="node-agent",endpoint="http-monitoring",instance="10.0.43.244:8085",job="infra/node-agent",namespace="infra",pod="node-agent-x4fff"}) 1 out of 1 times during -promscrape.suppressScrapeErrorsDelay=0s; the last error: cannot read data: cannot scrape "http://10.0.43.244:8085/metrics": Get "http://10.0.43.244:8085/metrics": context canceled

Anything else you would like to add:
Looking into the issue I could see that the velero logs contained:

INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}

So I switched the daemonset config to listen on port 8080 instead of 8085 and everything worked.

  • helm chart version and app version (use helm list -n <YOUR NAMESPACE>): 5.1.0 , default image tag.
  • Kubernetes version (use kubectl version): 1.26.8
  • Cloud provider or hardware configuration: hardware.
@jenting jenting added bug Something isn't working velero labels Oct 11, 2023
@PRBonham
Copy link

PRBonham commented Oct 13, 2023

I can confirm this, it was working before the upgrade to 5.1.0:

Velero version:

velero version
Client:
        Version: v1.12.0
        Git commit: 7112c62e493b0f7570f0e7cd2088f8cad968db99
Server:
        Version: v1.12.0

Helm history:

REVISION  UPDATED                   STATUS       CHART         APP VERSION  DESCRIPTION
37        Thu Oct  5 14:16:57 2023  superseded   velero-5.0.2  1.11.1       Upgrade complete
38        Fri Oct 13 11:54:48 2023  deployed     velero-5.1.0  1.12.0       Upgrade complete

Helm chart metrics section:

# Settings for Velero's prometheus metrics. Enabled by default.
metrics:
  enabled: true
  scrapeInterval: 30s
  scrapeTimeout: 10s

  # ...

  nodeAgentPodMonitor:
    autodetect: true
    enabled: true
    annotations: {}
    additionalLabels: {}

@al-cheb
Copy link
Contributor

al-cheb commented Oct 20, 2023

Looks like it has been fixed in this vmware-tanzu/velero#6784 PR

$ ./velero node-agent server
INFO[0000] Setting log-level to INFO                    
INFO[0000] Starting Velero node-agent server  (-)        logSource="/velero/pkg/cmd/cli/nodeagent/server.go:103"
1.6977906245496273e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
INFO[0000] Starting metric server for node agent at address [:8085]  logSource="/velero/pkg/cmd/cli/nodeagent/server.go:231"

@jwitko
Copy link
Author

jwitko commented Oct 20, 2023

Looks like it has been fixed in this vmware-tanzu/velero#6784 PR

$ ./velero node-agent server
INFO[0000] Setting log-level to INFO                    
INFO[0000] Starting Velero node-agent server  (-)        logSource="/velero/pkg/cmd/cli/nodeagent/server.go:103"
1.6977906245496273e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
INFO[0000] Starting metric server for node agent at address [:8085]  logSource="/velero/pkg/cmd/cli/nodeagent/server.go:231"

Yea, looks like the informational message is still incorrect but the port does appear to now be listening to 8085 as expected/documented.

Close this issue as vmware-tanzu/velero#6784 has been merged.

@jwitko jwitko closed this as completed Oct 20, 2023
@al-cheb
Copy link
Contributor

al-cheb commented Oct 20, 2023

Looks like it has been fixed in this vmware-tanzu/velero#6784 PR

$ ./velero node-agent server
INFO[0000] Setting log-level to INFO                    
INFO[0000] Starting Velero node-agent server  (-)        logSource="/velero/pkg/cmd/cli/nodeagent/server.go:103"
1.6977906245496273e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
INFO[0000] Starting metric server for node agent at address [:8085]  logSource="/velero/pkg/cmd/cli/nodeagent/server.go:231"

Yea, looks like the informational message is still incorrect but the port does appear to now be listening to 8085 as expected/documented.

Close this issue as vmware-tanzu/velero#6784 has been merged.

The node-agent server runs two metric endpoinds:

  • controller-runtime
  • node-agent
$ ss -ntulp | grep -e "8080" -e "8085"
tcp   LISTEN 0      4096                                  *:8080             *:*    users:(("velero",pid=175416,fd=7))  
tcp   LISTEN 0      4096                                  *:8085             *:*    users:(("velero",pid=175416,fd=8))

controller-runtime:

$ curl -s service:8080/metrics | grep controller_runtime
# HELP controller_runtime_active_workers Number of currently used workers per controller
# TYPE controller_runtime_active_workers gauge
controller_runtime_active_workers{controller="datadownload"} 0
controller_runtime_active_workers{controller="dataupload"} 0

node-agent:

$ curl -s service:8085/metrics | grep data_upload
# HELP podVolume_data_upload_cancel_total Total number of canceled uploaded snapshots
# TYPE podVolume_data_upload_cancel_total counter
podVolume_data_upload_cancel_total{node=""} 0

@bethmage
Copy link

bethmage commented Nov 6, 2023

Current HelmChart version : 5.1.3 https://github.com/vmware-tanzu/velero
use the same metrices.podAnnotaions for deployment and the node-agent-deamonset !

https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/templates/node-agent-daemonset.yaml#L42
https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/templates/deployment.yaml#L45

https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/values.yaml#L219

metrics:
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8085"
    prometheus.io/path: "/metrics"

This means for node-agent-deamonset the prometheus metric endpoint is: :8085/metrics which is unreachable via http.
For the node-agent-deamonset Pods metrics endpoint should be :8080/metrics and is reachable via http.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working velero
Projects
None yet
Development

No branches or pull requests

5 participants