Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Helm] Enable custom metrics, mount ConfigMap by default #351

Merged
merged 5 commits into from
Aug 16, 2024

Conversation

chipzoller
Copy link
Contributor

@chipzoller chipzoller commented Jul 2, 2024

Closes #170

This PR allows users to define their own custom metrics CSV in the Helm chart and mounts and consumes the ConfigMap (which was previously deployed but unused) by default. It incorporates changes originally made in PR #350.

After this PR, users may define custom metrics in the values file as the following:

customMetrics: |-
  # My custom metrics list
  DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned.
  DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM.

This customMetrics value is commented out by default to allow for backwards compatibility.

Signed-off-by: Chip Zoller <chipzoller@gmail.com>
@chipzoller chipzoller changed the title [Helm] All user-defined custom metrics [Helm] Allow user-defined custom metrics Jul 2, 2024
@chipzoller
Copy link
Contributor Author

cc @nvvfedorov and @glowkey for your suggestions/input.

@chipzoller
Copy link
Contributor Author

Any thoughts on this?

@chipzoller
Copy link
Contributor Author

Checking back up on this one as well. Any feedback or suggestions?

@nvvfedorov
Copy link
Collaborator

@chipzoller, Why is this solution better than the existing one, where you can define metrics in a config map? See: https://github.com/NVIDIA/dcgms-exporter/blob/main/deployment/templates/metrics-configmap.yaml

@chipzoller
Copy link
Contributor Author

The current solution requires users modify the ConfigMap AFTER it has been deployed whereas this solution allows them to define their customized list of metrics up front, at installation time.

@frittentheke
Copy link

frittentheke commented Jul 23, 2024

I was just about to raise a new feature request to include certain metrics per default, namely DCGM_FI_DEV_FB_TOTAL, DCGM_FI_DEV_FB_RESERVED, DCGM_FI_DEV_FB_USED_PERCENT.
While I still believe the defaults provided should absolutely be sensible and the 90% coverage of what people usually would want from their metrics, opening up and and using the Helm chart values.yaml to allow configuration of the metrics is the right way forward.

The list of metrics apparently is config and that should be something dynamic / configurable, not some baked in static file that can opportunistically be overwritten / edited via and out of band (non-Helm) config map edit. Everything than can sensibly be changed and configured should be made available solely via the configuration mechanism embraced by Helm: the values ™️

@glowkey
Copy link
Collaborator

glowkey commented Jul 25, 2024

I am inclined towards integrating this PR over #350. This PR seems to allow for the greatest flexibility going forward while not changing the default behavior.

@chipzoller
Copy link
Contributor Author

Would certainly respect whatever decision is made, but I still question the validity of needlessly creating resources but not using them. #350 is fully backwards compatible from a user perspective and obviates the need for them to manually define additional values to make use of the resources already present.

@glowkey
Copy link
Collaborator

glowkey commented Jul 26, 2024

I liked the ability to create and deploy a custom list of metrics all at once and from within a single values file. How would you envision users doing that with #350?

@chipzoller
Copy link
Contributor Author

I liked the ability to create and deploy a custom list of metrics all at once and from within a single values file. How would you envision users doing that with #350?

#350 doesn't focus on that intentionally which is why I opened #351 (this one). Ideally, you take both of them so the combined effect is users define a custom list of metrics in the values file (provided by #351) and that values file is automatically picked up and used without users needing to specify dcgm-exporter should actually mount it, which simplifies the total values required (provided by #350). So, net effect after both, is the values file is just this simple:

customMetrics: |-
  # My custom metrics list
  DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned.
  DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM.

@glowkey
Copy link
Collaborator

glowkey commented Jul 26, 2024

Thanks for the additional clarification, makes sense and seems like a valid approach to solving this problem.

@frittentheke
Copy link

While I like the approach of having a "baseline" that just works and then some "custom metrics" on top - what happens if I want to exclude metrics or make changes to the ones "built-in"?

I absolutely do not want to block the way forward with more lengthy discussions and alternatives.
But what about simplifying this by externalizing the whole default metrics config file to the (default) Chart values.yaml which can then be completely and flexibly overridden via one's own values.yaml.

This provides full flexibility and aligns nicely with how other configuration files are configurable, e.g. in the gpu-operator's values:

@glowkey
Copy link
Collaborator

glowkey commented Jul 26, 2024

But what about simplifying this by externalizing the whole default metrics config file

+1 I think that's a good extension to this approach.

@chipzoller
Copy link
Contributor Author

what happens if I want to exclude metrics or make changes to the ones "built-in"?

You'd copy the contents of the ConfigMap which contains the default metrics and supply those contents, minus any additions/removals you want, to the customMetrics value.

@glowkey
Copy link
Collaborator

glowkey commented Jul 29, 2024

@chipzoller would you be willing to combine this PR with #350 and include the full dcp-metrics-included.csv contents in the customMetrics section (commented out) in a new PR that we'll look at incorporating into the next major version of DCGM-Exporter?

Signed-off-by: Chip Zoller <chipzoller@gmail.com>
Signed-off-by: Chip Zoller <chipzoller@gmail.com>
Signed-off-by: Chip Zoller <chipzoller@gmail.com>
@chipzoller
Copy link
Contributor Author

Is it OK that I just incorporated them here? If so, I'll rename/update this PR and close #350.

@glowkey
Copy link
Collaborator

glowkey commented Jul 30, 2024

Absolutely, thank you for the contribution!

@chipzoller chipzoller changed the title [Helm] Allow user-defined custom metrics [Helm] Enable custom metrics, mount ConfigMap by default Jul 30, 2024
@chipzoller
Copy link
Contributor Author

Ok, done. Ready for review, @glowkey.

@chipzoller
Copy link
Contributor Author

Checking back here to see if everything looks good.

@glowkey
Copy link
Collaborator

glowkey commented Aug 5, 2024

Yes, thanks for following up. We plan on running it through tests and merging this week.

glowkey
glowkey previously approved these changes Aug 12, 2024
@chipzoller
Copy link
Contributor Author

Should I also submit a PR to the operator to be able to specify these custom metrics in its values?

@glowkey
Copy link
Collaborator

glowkey commented Aug 15, 2024

Feel free, though their deployment is slightly different. Also, can you sign your commit?

@frittentheke
Copy link

Should I also submit a PR to the operator to be able to specify these custom metrics in its values?

please kindly do @chipzoller!
Thank a bunch for your work in regards to metrics. Highly appreciated.

@chipzoller
Copy link
Contributor Author

Feel free, though their deployment is slightly different. Also, can you sign your commit?

My commits are signed. There is no DCO check in this repo as I see.

When can we expect this to be merged so a follow-on PR for the operator can be created?

@chipzoller
Copy link
Contributor Author

Since it looks like everything in the operator is driven through the ClusterPolicy CR, updating all the relevant code to add this one field is probably going to be beyond my abilities (and available time). I can, however, log an enhancement issue in the operator repo so it can be tracked and possibly picked up by someone else, possibly even yourself, @frittentheke.

@glowkey
Copy link
Collaborator

glowkey commented Aug 16, 2024

The commits are not showing up as verified. I can see they are signed-off but not signed. The workflow prevents merging unsigned commits.

@chipzoller
Copy link
Contributor Author

@glowkey, see now.

@frittentheke
Copy link

Since it looks like everything in the operator is driven through the ClusterPolicy CR, updating all the relevant code to add this one field is probably going to be beyond my abilities (and available time). I can, however, log an enhancement issue in the operator repo so it can be tracked and possibly picked up by someone else, possibly even yourself, @frittentheke.

That would be awesome @chipzoller if you could write up an issue so that is can be tracked as somewhat of a missing capability / integration of the operator with the exporter!

@glowkey glowkey merged commit 178d22f into NVIDIA:main Aug 16, 2024
1 check passed
@chipzoller chipzoller deleted the cz-cm-template-helm branch August 16, 2024 14:36
@chipzoller
Copy link
Contributor Author

See NVIDIA/gpu-operator#934

@chipzoller
Copy link
Contributor Author

Corresponding PR sent in NVIDIA/gpu-operator#949

@gpgn
Copy link

gpgn commented Sep 17, 2024

Hi @chipzoller do you perhaps have an expected time for this to land in a release? It would be great for us to start configuring additional metrics directly in the Helm chart. Thanks!

@chipzoller
Copy link
Contributor Author

I am not a maintainer of this project, only a contributor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Helm: exporter-metrics-config-map should not be applied by default
5 participants