Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat:Update k8s-device-plugin to v0.14.5 to Resolve nanoGPT Runtime Issue #391

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

haitwang-cloud
Copy link
Contributor

What type of PR is this?

During an offline debugging session with @archlitchi , we identified that the current NVIDIA device plugin (v1.4.0) is causing compatibility issues with nanoGPT, preventing it from running properly. This issue persists even after setting CUDA_DISABLE_CONTROL to true and removing ld.so.preload from the GPU node.
We've confirmed that this problem also occurs in version 0.14.0 of the k8s-device-plugin. To resolve this, we need to update the k8s-device-plugin to at least version 0.14.5

/kind bug
What this PR does / why we need it:

This update is crucial for ensuring that our GPU resources are utilized effectively and that applications like nanoGPT can run without the encountered hindrances.

Which issue(s) this PR fixes:
Fixes # #347

Special notes for your reviewer:
I can not build the HAMi in my local env, so I am not be able to run the E2E testing , could u plz help to build a test img with this PR?

Does this PR introduce a user-facing change?:

No

@hami-robott hami-robott bot added kind/bug Something isn't working dco-signoff: no labels Jul 18, 2024
@hami-robott hami-robott bot requested review from archlitchi and wawa0210 July 18, 2024 06:36
Copy link
Contributor

hami-robott bot commented Jul 18, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: haitwang-cloud
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…sues with Hami and improve nanoGPT execution

Signed-off-by: haitwang-cloud <haitao_wht@outlook.com>
@haitwang-cloud haitwang-cloud force-pushed the upgrade-k8s-device-version branch from bef4d51 to 499bb7b Compare July 18, 2024 06:49
@haitwang-cloud haitwang-cloud changed the title feat:arrow_up:Update k8s-device-plugin to v0.14.5 to Resolve nanoGPT Runtime Issue feat:Update k8s-device-plugin to v0.14.5 to Resolve nanoGPT Runtime Issue Jul 19, 2024
@haitwang-cloud haitwang-cloud marked this pull request as draft July 22, 2024 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant