-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Helm] Enable custom metrics, mount ConfigMap by default #351
Conversation
Signed-off-by: Chip Zoller <chipzoller@gmail.com>
cc @nvvfedorov and @glowkey for your suggestions/input. |
Any thoughts on this? |
Checking back up on this one as well. Any feedback or suggestions? |
@chipzoller, Why is this solution better than the existing one, where you can define metrics in a config map? See: https://github.com/NVIDIA/dcgms-exporter/blob/main/deployment/templates/metrics-configmap.yaml |
The current solution requires users modify the ConfigMap AFTER it has been deployed whereas this solution allows them to define their customized list of metrics up front, at installation time. |
I was just about to raise a new feature request to include certain metrics per default, namely The list of metrics apparently is config and that should be something dynamic / configurable, not some baked in static file that can opportunistically be overwritten / edited via and out of band (non-Helm) config map edit. Everything than can sensibly be changed and configured should be made available solely via the configuration mechanism embraced by Helm: the values ™️ |
I am inclined towards integrating this PR over #350. This PR seems to allow for the greatest flexibility going forward while not changing the default behavior. |
Would certainly respect whatever decision is made, but I still question the validity of needlessly creating resources but not using them. #350 is fully backwards compatible from a user perspective and obviates the need for them to manually define additional values to make use of the resources already present. |
I liked the ability to create and deploy a custom list of metrics all at once and from within a single values file. How would you envision users doing that with #350? |
#350 doesn't focus on that intentionally which is why I opened #351 (this one). Ideally, you take both of them so the combined effect is users define a custom list of metrics in the values file (provided by #351) and that values file is automatically picked up and used without users needing to specify dcgm-exporter should actually mount it, which simplifies the total values required (provided by #350). So, net effect after both, is the values file is just this simple: customMetrics: |-
# My custom metrics list
DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned.
DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM. |
Thanks for the additional clarification, makes sense and seems like a valid approach to solving this problem. |
While I like the approach of having a "baseline" that just works and then some "custom metrics" on top - what happens if I want to exclude metrics or make changes to the ones "built-in"? I absolutely do not want to block the way forward with more lengthy discussions and alternatives. This provides full flexibility and aligns nicely with how other configuration files are configurable, e.g. in the gpu-operator's values: |
+1 I think that's a good extension to this approach. |
You'd copy the contents of the ConfigMap which contains the default metrics and supply those contents, minus any additions/removals you want, to the |
@chipzoller would you be willing to combine this PR with #350 and include the full |
Signed-off-by: Chip Zoller <chipzoller@gmail.com>
Signed-off-by: Chip Zoller <chipzoller@gmail.com>
Signed-off-by: Chip Zoller <chipzoller@gmail.com>
Is it OK that I just incorporated them here? If so, I'll rename/update this PR and close #350. |
Absolutely, thank you for the contribution! |
Ok, done. Ready for review, @glowkey. |
Checking back here to see if everything looks good. |
Yes, thanks for following up. We plan on running it through tests and merging this week. |
Should I also submit a PR to the operator to be able to specify these custom metrics in its values? |
Feel free, though their deployment is slightly different. Also, can you sign your commit? |
please kindly do @chipzoller! |
My commits are signed. There is no DCO check in this repo as I see. When can we expect this to be merged so a follow-on PR for the operator can be created? |
Since it looks like everything in the operator is driven through the ClusterPolicy CR, updating all the relevant code to add this one field is probably going to be beyond my abilities (and available time). I can, however, log an enhancement issue in the operator repo so it can be tracked and possibly picked up by someone else, possibly even yourself, @frittentheke. |
The commits are not showing up as verified. I can see they are signed-off but not signed. The workflow prevents merging unsigned commits. |
@glowkey, see now. |
That would be awesome @chipzoller if you could write up an issue so that is can be tracked as somewhat of a missing capability / integration of the operator with the exporter! |
Corresponding PR sent in NVIDIA/gpu-operator#949 |
Hi @chipzoller do you perhaps have an expected time for this to land in a release? It would be great for us to start configuring additional metrics directly in the Helm chart. Thanks! |
I am not a maintainer of this project, only a contributor. |
Closes #170
This PR allows users to define their own custom metrics CSV in the Helm chart and mounts and consumes the ConfigMap (which was previously deployed but unused) by default. It incorporates changes originally made in PR #350.
After this PR, users may define custom metrics in the values file as the following:
This
customMetrics
value is commented out by default to allow for backwards compatibility.