[Helm] Enable custom metrics, mount ConfigMap by default #351

chipzoller · 2024-07-02T22:47:26Z

Closes #170

This PR allows users to define their own custom metrics CSV in the Helm chart and mounts and consumes the ConfigMap (which was previously deployed but unused) by default. It incorporates changes originally made in PR #350.

After this PR, users may define custom metrics in the values file as the following:

customMetrics: |-
  # My custom metrics list
  DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned.
  DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM.

This customMetrics value is commented out by default to allow for backwards compatibility.

Signed-off-by: Chip Zoller <chipzoller@gmail.com>

chipzoller · 2024-07-02T22:49:39Z

cc @nvvfedorov and @glowkey for your suggestions/input.

chipzoller · 2024-07-11T12:47:01Z

Any thoughts on this?

chipzoller · 2024-07-22T13:43:59Z

Checking back up on this one as well. Any feedback or suggestions?

nvvfedorov · 2024-07-22T15:29:50Z

@chipzoller, Why is this solution better than the existing one, where you can define metrics in a config map? See: https://github.com/NVIDIA/dcgms-exporter/blob/main/deployment/templates/metrics-configmap.yaml

chipzoller · 2024-07-22T15:34:25Z

The current solution requires users modify the ConfigMap AFTER it has been deployed whereas this solution allows them to define their customized list of metrics up front, at installation time.

frittentheke · 2024-07-23T08:21:37Z

I was just about to raise a new feature request to include certain metrics per default, namely DCGM_FI_DEV_FB_TOTAL, DCGM_FI_DEV_FB_RESERVED, DCGM_FI_DEV_FB_USED_PERCENT.
While I still believe the defaults provided should absolutely be sensible and the 90% coverage of what people usually would want from their metrics, opening up and and using the Helm chart values.yaml to allow configuration of the metrics is the right way forward.

The list of metrics apparently is config and that should be something dynamic / configurable, not some baked in static file that can opportunistically be overwritten / edited via and out of band (non-Helm) config map edit. Everything than can sensibly be changed and configured should be made available solely via the configuration mechanism embraced by Helm: the values ™️

glowkey · 2024-07-25T21:01:51Z

I am inclined towards integrating this PR over #350. This PR seems to allow for the greatest flexibility going forward while not changing the default behavior.

chipzoller · 2024-07-25T21:42:04Z

Would certainly respect whatever decision is made, but I still question the validity of needlessly creating resources but not using them. #350 is fully backwards compatible from a user perspective and obviates the need for them to manually define additional values to make use of the resources already present.

glowkey · 2024-07-26T14:22:29Z

I liked the ability to create and deploy a custom list of metrics all at once and from within a single values file. How would you envision users doing that with #350?

chipzoller · 2024-07-26T14:28:39Z

I liked the ability to create and deploy a custom list of metrics all at once and from within a single values file. How would you envision users doing that with #350?

#350 doesn't focus on that intentionally which is why I opened #351 (this one). Ideally, you take both of them so the combined effect is users define a custom list of metrics in the values file (provided by #351) and that values file is automatically picked up and used without users needing to specify dcgm-exporter should actually mount it, which simplifies the total values required (provided by #350). So, net effect after both, is the values file is just this simple:

customMetrics: |-
  # My custom metrics list
  DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned.
  DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM.

glowkey · 2024-07-26T15:45:03Z

Thanks for the additional clarification, makes sense and seems like a valid approach to solving this problem.

frittentheke · 2024-07-26T18:47:09Z

While I like the approach of having a "baseline" that just works and then some "custom metrics" on top - what happens if I want to exclude metrics or make changes to the ones "built-in"?

I absolutely do not want to block the way forward with more lengthy discussions and alternatives.
But what about simplifying this by externalizing the whole default metrics config file to the (default) Chart values.yaml which can then be completely and flexibly overridden via one's own values.yaml.

This provides full flexibility and aligns nicely with how other configuration files are configurable, e.g. in the gpu-operator's values:

glowkey · 2024-07-26T20:33:40Z

But what about simplifying this by externalizing the whole default metrics config file

+1 I think that's a good extension to this approach.

chipzoller · 2024-07-27T13:01:18Z

what happens if I want to exclude metrics or make changes to the ones "built-in"?

You'd copy the contents of the ConfigMap which contains the default metrics and supply those contents, minus any additions/removals you want, to the customMetrics value.

glowkey · 2024-07-29T21:42:57Z

@chipzoller would you be willing to combine this PR with #350 and include the full dcp-metrics-included.csv contents in the customMetrics section (commented out) in a new PR that we'll look at incorporating into the next major version of DCGM-Exporter?

Signed-off-by: Chip Zoller <chipzoller@gmail.com>

chipzoller · 2024-07-29T23:17:51Z

Is it OK that I just incorporated them here? If so, I'll rename/update this PR and close #350.

glowkey · 2024-07-30T14:51:57Z

Absolutely, thank you for the contribution!

chipzoller · 2024-07-30T15:24:28Z

Ok, done. Ready for review, @glowkey.

chipzoller · 2024-08-05T12:17:29Z

Checking back here to see if everything looks good.

glowkey · 2024-08-05T14:21:34Z

Yes, thanks for following up. We plan on running it through tests and merging this week.

chipzoller · 2024-08-15T19:37:02Z

Should I also submit a PR to the operator to be able to specify these custom metrics in its values?

glowkey · 2024-08-15T20:33:56Z

Feel free, though their deployment is slightly different. Also, can you sign your commit?

frittentheke · 2024-08-15T21:04:59Z

Should I also submit a PR to the operator to be able to specify these custom metrics in its values?

please kindly do @chipzoller!
Thank a bunch for your work in regards to metrics. Highly appreciated.

chipzoller · 2024-08-16T09:38:06Z

Feel free, though their deployment is slightly different. Also, can you sign your commit?

My commits are signed. There is no DCO check in this repo as I see.

When can we expect this to be merged so a follow-on PR for the operator can be created?

chipzoller · 2024-08-16T10:19:47Z

Since it looks like everything in the operator is driven through the ClusterPolicy CR, updating all the relevant code to add this one field is probably going to be beyond my abilities (and available time). I can, however, log an enhancement issue in the operator repo so it can be tracked and possibly picked up by someone else, possibly even yourself, @frittentheke.

glowkey · 2024-08-16T13:51:44Z

The commits are not showing up as verified. I can see they are signed-off but not signed. The workflow prevents merging unsigned commits.

chipzoller · 2024-08-16T14:17:23Z

@glowkey, see now.

frittentheke · 2024-08-16T14:18:28Z

Since it looks like everything in the operator is driven through the ClusterPolicy CR, updating all the relevant code to add this one field is probably going to be beyond my abilities (and available time). I can, however, log an enhancement issue in the operator repo so it can be tracked and possibly picked up by someone else, possibly even yourself, @frittentheke.

That would be awesome @chipzoller if you could write up an issue so that is can be tracked as somewhat of a missing capability / integration of the operator with the exporter!

chipzoller · 2024-08-16T14:46:43Z

See NVIDIA/gpu-operator#934

chipzoller · 2024-08-23T13:06:39Z

Corresponding PR sent in NVIDIA/gpu-operator#949

gpgn · 2024-09-17T09:26:26Z

Hi @chipzoller do you perhaps have an expected time for this to land in a release? It would be great for us to start configuring additional metrics directly in the Helm chart. Thanks!

chipzoller · 2024-09-17T10:51:19Z

I am not a maintainer of this project, only a contributor.

allow custom metrics

a0ff114

Signed-off-by: Chip Zoller <chipzoller@gmail.com>

chipzoller mentioned this pull request Jul 2, 2024

[Helm] Enable ConfigMap mount by default #350

Closed

chipzoller changed the title ~~[Helm] All user-defined custom metrics~~ [Helm] Allow user-defined custom metrics Jul 2, 2024

chipzoller added 3 commits July 29, 2024 18:08

indentation normalization

d6c10fa

Signed-off-by: Chip Zoller <chipzoller@gmail.com>

port over full metrics ConfigMap contents

07de51d

Signed-off-by: Chip Zoller <chipzoller@gmail.com>

pull in extra mounts from PR NVIDIA#350

21989d3

Signed-off-by: Chip Zoller <chipzoller@gmail.com>

chipzoller changed the title ~~[Helm] Allow user-defined custom metrics~~ [Helm] Enable custom metrics, mount ConfigMap by default Jul 30, 2024

glowkey previously approved these changes Aug 12, 2024

View reviewed changes

signed

e3cd0e0

chipzoller dismissed glowkey’s stale review via e3cd0e0 August 16, 2024 14:16

glowkey approved these changes Aug 16, 2024

View reviewed changes

glowkey merged commit 178d22f into NVIDIA:main Aug 16, 2024
1 check passed

chipzoller deleted the cz-cm-template-helm branch August 16, 2024 14:36

chipzoller mentioned this pull request Aug 16, 2024

[Feature] Support for new customMetrics value in DCGM Exporter NVIDIA/gpu-operator#934

Closed

This was referenced Aug 16, 2024

Update contribution doc to require signing #376

Open

Allow custom metrics for DCGM Exporter NVIDIA/gpu-operator#949

Merged

glowkey added a commit that referenced this pull request Jan 7, 2025

[Helm] Enable custom metrics, mount ConfigMap by default (#351)

c2d3dfa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Helm] Enable custom metrics, mount ConfigMap by default #351

[Helm] Enable custom metrics, mount ConfigMap by default #351

chipzoller commented Jul 2, 2024 •

edited

Loading

chipzoller commented Jul 2, 2024

chipzoller commented Jul 11, 2024

chipzoller commented Jul 22, 2024

nvvfedorov commented Jul 22, 2024

chipzoller commented Jul 22, 2024

frittentheke commented Jul 23, 2024 •

edited

Loading

glowkey commented Jul 25, 2024

chipzoller commented Jul 25, 2024

glowkey commented Jul 26, 2024

chipzoller commented Jul 26, 2024

glowkey commented Jul 26, 2024

frittentheke commented Jul 26, 2024

glowkey commented Jul 26, 2024

chipzoller commented Jul 27, 2024

glowkey commented Jul 29, 2024

chipzoller commented Jul 29, 2024

glowkey commented Jul 30, 2024

chipzoller commented Jul 30, 2024

chipzoller commented Aug 5, 2024

glowkey commented Aug 5, 2024

chipzoller commented Aug 15, 2024

glowkey commented Aug 15, 2024

frittentheke commented Aug 15, 2024

chipzoller commented Aug 16, 2024

chipzoller commented Aug 16, 2024

glowkey commented Aug 16, 2024

chipzoller commented Aug 16, 2024

frittentheke commented Aug 16, 2024

chipzoller commented Aug 16, 2024

chipzoller commented Aug 23, 2024

gpgn commented Sep 17, 2024

chipzoller commented Sep 17, 2024

[Helm] Enable custom metrics, mount ConfigMap by default #351

[Helm] Enable custom metrics, mount ConfigMap by default #351

Conversation

chipzoller commented Jul 2, 2024 • edited Loading

chipzoller commented Jul 2, 2024

chipzoller commented Jul 11, 2024

chipzoller commented Jul 22, 2024

nvvfedorov commented Jul 22, 2024

chipzoller commented Jul 22, 2024

frittentheke commented Jul 23, 2024 • edited Loading

glowkey commented Jul 25, 2024

chipzoller commented Jul 25, 2024

glowkey commented Jul 26, 2024

chipzoller commented Jul 26, 2024

glowkey commented Jul 26, 2024

frittentheke commented Jul 26, 2024

glowkey commented Jul 26, 2024

chipzoller commented Jul 27, 2024

glowkey commented Jul 29, 2024

chipzoller commented Jul 29, 2024

glowkey commented Jul 30, 2024

chipzoller commented Jul 30, 2024

chipzoller commented Aug 5, 2024

glowkey commented Aug 5, 2024

chipzoller commented Aug 15, 2024

glowkey commented Aug 15, 2024

frittentheke commented Aug 15, 2024

chipzoller commented Aug 16, 2024

chipzoller commented Aug 16, 2024

glowkey commented Aug 16, 2024

chipzoller commented Aug 16, 2024

frittentheke commented Aug 16, 2024

chipzoller commented Aug 16, 2024

chipzoller commented Aug 23, 2024

gpgn commented Sep 17, 2024

chipzoller commented Sep 17, 2024

chipzoller commented Jul 2, 2024 •

edited

Loading

frittentheke commented Jul 23, 2024 •

edited

Loading