-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU-operator not applying driver version changes on EKS #542
Comments
@sjkoelle yes, this is due to the pre-installed driver on the system already, we cannot overwrite it from the driver container. You would need to use stock ubuntu AMI with GPU node pool to support with the GPU operator. |
This limitation has been documented here |
Thanks for the help. I moved onto a supported AMI and am now running
Took a while to roll out (not much happened for like 6 minutes after node refresh - assume this was installation).... but it seems to be working now! |
It would be great to be able to specify the operating system explicitly as an argument to the driver version. We would like to deploy with minimal downtime, and the old nodes are running Amazon Linux. The gpu operator pods don't seem to hurt these old nodes (just fail innocuously), but they also seem to detect that the operating system is Amazon Linux and pass that as an argument to the daemonset. For example, even if I manually edit the daemonset to use Ubuntu drivers, eventually this gets reset to Amazon Linux and fails. If it stayed as Ubuntu, I think it might work on the new nodes? |
@sjkoelle we don't support mixed nodes, to get this working you can follow below steps.
|
@shivamerla Thank you for your amazing help on this. Looks like the true source of our error lies elsewhere to our driver version, but your suggestion totally worked and we are now able to test our update in prod without downtime. |
also linking this related thread for posterity awslabs/amazon-eks-ami#1060 |
We are having some issues with nodes being tagged with Capacity and Allocatable nvidia.com/gpu 0 instead of 1 |
@shivamerla I have 3 questions: |
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
I am trying to update my nvidia drivers from 470 to 510 on amazon eks. I think that somehow the gpu-operator is not "seeing" the target node, since the operator is neither labelling the node as upgrade-required or detecting the node label as upgrade-required if I manually label it. Any help would be appreciated. Is it even possible to update driver versions for prebuilt AMIs?
2. Steps to reproduce the issue
In a suitable kube context, I ran
noglob helm install --kube-context <CONTEXT> -n gpu-operator --create-namespace nvidia/gpu-operator --wait --generate-name --set driver.repository=docker.io/nvidia --set driver.version="510.85.02" --set validator.plugin.env[0].name=WITH_WORKLOAD --set-string validator.plugin.env[0].value=false --set psp.enabled=true
All the pods seem to deploy properly, and the cluster-policy reflects the right version. However, the gpu operator logs don't seem to see the node that I am trying to update, as I get this message in the operator logs.
Monitoring the update status with
shows either nothing or upgrade-done. However, my target node is still running driver version 470 even after restart. I also tried manually labelling as upgrade-required
and now it seems stuck in that state, and the operator logs still show zero nodes with upgrade required. I have also noticed an error in the gpu operator pods
E0627 00:02:58.978808 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1beta1.PodSecurityPolicy: unknown (get podsecuritypolicies.policy)
even though psp is enabled.
I tried the above steps also with migManager=false (since I am running on a T4). After checking out #525 I have also tried
Also tried upgrade to kube 1.24 and needed to remove the nvidia-runtime flag to make the gpu allocatable. In all cases, the problem seemed to remain.
3. Information to attach (optional if deemed irrelevant)
kubectl get pods -n gpu-operator
kubectl get ds --all-namespaces
The gpu-node in question is running
kubectl describe pod -n NAMESPACE POD_NAME
kubectl logs -n NAMESPACE POD_NAME
Thanks
The text was updated successfully, but these errors were encountered: