feat: add support for gpu sharing metrics in k8s #432
+422
−55
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We add support for capturing separate metrics when running dcgm-exporter in K8s clusters that have GPU sharing enabled, including time-sharing and MPS. This should now support GPU sharing on MIG devices as well.
We ensure this is supported for both the NVIDIA and GKE device plugins, respectively at:
First, we make a small fix to the Kubernetes PodMapper tranform processor. Specifically we update the regular expression used in building the device mapping to properly capture pod attributes in both MIG and MIG-with-sharing GPUs in GKE.
The bulk of the change is guarded by a new configuration parameter, which can be passed in as a flag
--kubernetes-virtual-gpus
or as an environment variableKUBERNETES_VIRTUAL_GPUS
. If set, the Kubernetes PodMapper tranform processor uses a different mechanism to build the device mapping, which creates a copy of the metric for every shared (i.e. virtual) GPU exposed by the device plugin. To disambiguate the generated timeseries, it adds a new label "vgpu" set to the detected shared GPU replica.This also fixes an issue where pod attributes are not guaranteed to be consistently associated with the same metric. If the podresources API does not consistently return the device-ids in the same order between calls, the device-to-pod association in the map can change between scrapes due to an overwrite that happens in the Process loop.
Ultimately, we may wish to make this the default behavior. However, guarding it behind a flag:
without
orignore
functions, may break existing dashboards and alerts. Allowing users to opt-in via a flag ensures backwards compatibility in these scenarios.Finally, we update the unit tests to ensure thorough coverage for the changes.