Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for gpu sharing metrics in k8s #432

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pintohutch
Copy link
Contributor

We add support for capturing separate metrics when running dcgm-exporter in K8s clusters that have GPU sharing enabled, including time-sharing and MPS. This should now support GPU sharing on MIG devices as well.

We ensure this is supported for both the NVIDIA and GKE device plugins, respectively at:

First, we make a small fix to the Kubernetes PodMapper tranform processor. Specifically we update the regular expression used in building the device mapping to properly capture pod attributes in both MIG and MIG-with-sharing GPUs in GKE.

The bulk of the change is guarded by a new configuration parameter, which can be passed in as a flag --kubernetes-virtual-gpus or as an environment variable KUBERNETES_VIRTUAL_GPUS. If set, the Kubernetes PodMapper tranform processor uses a different mechanism to build the device mapping, which creates a copy of the metric for every shared (i.e. virtual) GPU exposed by the device plugin. To disambiguate the generated timeseries, it adds a new label "vgpu" set to the detected shared GPU replica.

This also fixes an issue where pod attributes are not guaranteed to be consistently associated with the same metric. If the podresources API does not consistently return the device-ids in the same order between calls, the device-to-pod association in the map can change between scrapes due to an overwrite that happens in the Process loop.

Ultimately, we may wish to make this the default behavior. However, guarding it behind a flag:

  1. Mitigates any risk of the change in case of bugs
  2. Given the feature adds a new label, it is possible PromQL queries performing deaggregation using, e.g. without or ignore functions, may break existing dashboards and alerts. Allowing users to opt-in via a flag ensures backwards compatibility in these scenarios.

Finally, we update the unit tests to ensure thorough coverage for the changes.

We add support for capturing separate metrics when running dcgm-exporter
in K8s clusters that have GPU sharing enabled, including time-sharing
and MPS. This should now support GPU sharing on MIG devices as well.

We ensure this is supported for both the NVIDIA and GKE device plugins,
respectively at:
* https://github.com/NVIDIA/k8s-device-plugin
* https://github.com/GoogleCloudPlatform/container-engine-accelerators

First, we make a small fix to the Kubernetes PodMapper tranform
processor. Specifically we update the regular expression used in
building the device mapping to properly capture pod attributes in both
MIG and MIG-with-sharing GPUs in GKE.

The bulk of the change is guarded by a new configuration parameter,
which can be passed in as a flag `--kubernetes-virtual-gpus` or as an
environment variable `KUBERNETES_VIRTUAL_GPUS`. If set, the Kubernetes
PodMapper tranform processor uses a different mechanism to build the
device mapping, which creates a copy of the metric for every shared
(i.e. virtual) GPU exposed by the device plugin. To disambiguate the
generated timeseries, it adds a new label "vgpu" set to the detected
shared GPU replica.

This also fixes an issue where pod attributes are not guaranteed to be
consistently associated with the same metric. If the podresources API
does not consistently return the device-ids in the same order between
calls, the device-to-pod association in the map can change between
scrapes due to an overwrite that happens in the Process loop.

Ultimately, we may wish to make this the default behavior. However,
guarding it behind a flag:
1. Mitigates any risk of the change in case of bugs
2. Given the feature adds a new label, it is possible PromQL queries
   performing deaggregation using, e.g.  `without` or `ignore`
   functions, may break existing dashboards and alerts. Allowing users
   to opt-in via a flag ensures backwards compatibility in these
   scenarios.

Finally, we update the unit tests to ensure thorough coverage for the
changes.
@pintohutch
Copy link
Contributor Author

pintohutch commented Dec 12, 2024

Fixes #307

@glowkey
Copy link
Collaborator

glowkey commented Dec 12, 2024

Thank you for the contribution! We will test and review the PR in the coming weeks.

Please be aware that we are on the verge of releasing a new 4.0 version of DCGM-Exporter so this PR will likely need to be updated after that happens.

@pintohutch
Copy link
Contributor Author

Thanks @glowkey - I actually may factor out the smaller bugfix in a separate PR:

First, we make a small fix to the Kubernetes PodMapper tranform processor. Specifically we update the regular expression used in building the device mapping to properly capture pod attributes in both MIG and MIG-with-sharing GPUs in GKE.

That way, this change just focuses on the new KUBERNETES_VIRTUAL_GPUS addition.

@pintohutch
Copy link
Contributor Author

Moved the bugfix fix over to #433

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants