GPU-operator not applying driver version changes on EKS #542

sjkoelle · 2023-06-26T18:47:58Z

1. Quick Debug Checklist

[n] Are you running on an Ubuntu 18.04 node? 22.04
[y] Are you running Kubernetes v1.13+? 1.23 (now tried upgrading to 1.24 as well)
[y] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? docker 20.10
[?] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
[I think so (cluster policy reflects helm call).] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I am trying to update my nvidia drivers from 470 to 510 on amazon eks. I think that somehow the gpu-operator is not "seeing" the target node, since the operator is neither labelling the node as upgrade-required or detecting the node label as upgrade-required if I manually label it. Any help would be appreciated. Is it even possible to update driver versions for prebuilt AMIs?

2. Steps to reproduce the issue

In a suitable kube context, I ran

noglob helm install --kube-context <CONTEXT> -n gpu-operator --create-namespace nvidia/gpu-operator --wait --generate-name --set driver.repository=docker.io/nvidia --set driver.version="510.85.02" --set validator.plugin.env[0].name=WITH_WORKLOAD --set-string validator.plugin.env[0].value=false --set psp.enabled=true

All the pods seem to deploy properly, and the cluster-policy reflects the right version. However, the gpu operator logs don't seem to see the node that I am trying to update, as I get this message in the operator logs.

{"level":"info","ts":1687803309.8484328,"logger":"controllers.Upgrade","msg":"Node states:","Unknown":0,"upgrade-done":0,"upgrade-required":0,"cordon-required":0,"wait-for-jobs-required":0,"pod-deletion-required":0,"upgrade-failed":0,"drain-required":0,"pod-restart-required":0,"validation-required":0,"uncordon-required":0}
{"level":"info","ts":1687803309.8484778,"logger":"controllers.Upgrade","msg":"Upgrades in progress","currently in progress":0,"max parallel upgrades":1,"upgrade slots available":0,"currently unavailable nodes":0,"total number of nodes":0,"maximum nodes that can be unavailable":0}

Monitoring the update status with

kubectl get node -l nvidia.com/gpu.present -ojsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}{"\n"}{end}'

shows either nothing or upgrade-done. However, my target node is still running driver version 470 even after restart. I also tried manually labelling as upgrade-required

kubectl label node ip-10-0-1-222.us-west-2.compute.internal  nvidia.com/gpu-driver-upgrade-state=upgrade-required --overwrite

and now it seems stuck in that state, and the operator logs still show zero nodes with upgrade required. I have also noticed an error in the gpu operator pods

E0627 00:02:58.978808 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1beta1.PodSecurityPolicy: unknown (get podsecuritypolicies.policy)

even though psp is enabled.

I tried the above steps also with migManager=false (since I am running on a T4). After checking out #525 I have also tried

noglob helm install --kube-context <CONTEXT> -n gpu-operator --create-namespace nvidia/gpu-operator --wait --generate-name --set driver.repository=[docker.io/nvidia](http://docker.io/nvidia) --set driver.version=“510.108.03” --set validator.plugin.env[0].name=WITH_WORKLOAD --set-string validator.plugin.env[0].value=false --set psp.enabled=true --set migManager.enabled=false --set driver.enabled=false --set toolkit.enabled=false --set operator.runtimeClass=nvidia-container-runtime

Also tried upgrade to kube 1.24 and needed to remove the nvidia-runtime flag to make the gpu allocatable. In all cases, the problem seemed to remain.

3. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods -n gpu-operator

(base) ➜  uberduck git:(canary) ✗ kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-ckqkm                                       1/1     Running     0          31m
gpu-operator-1687801930-node-feature-discovery-master-799dpn9jq   1/1     Running     0          52m
gpu-operator-1687801930-node-feature-discovery-worker-dhs68       1/1     Running     0          31m
gpu-operator-1687801930-node-feature-discovery-worker-dl6z7       1/1     Running     0          52m
gpu-operator-1687801930-node-feature-discovery-worker-q6j7k       1/1     Running     0          52m
gpu-operator-78fc47fd7c-kc9rw                                     1/1     Running     0          41m
nvidia-container-toolkit-daemonset-dfb67                          1/1     Running     0          31m
nvidia-cuda-validator-btc48                                       0/1     Completed   0          31m
nvidia-dcgm-exporter-962r8                                        1/1     Running     0          31m
nvidia-device-plugin-daemonset-js2hz                              1/1     Running     0          31m
nvidia-operator-validator-62bzc                                   1/1     Running     0          31m

kubernetes daemonset status: kubectl get ds --all-namespaces

(base) ➜  uberduck git:(canary) ✗ kubectl get ds --all-namespaces
NAMESPACE      NAME                                                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-operator   gpu-feature-discovery                                   1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true   52m
gpu-operator   gpu-operator-1687801930-node-feature-discovery-worker   3         3         3       3            3           <none>                                             53m
gpu-operator   nvidia-container-toolkit-daemonset                      1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true       52m
gpu-operator   nvidia-dcgm-exporter                                    1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true           52m
gpu-operator   nvidia-device-plugin-daemonset                          1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true           52m
gpu-operator   nvidia-driver-daemonset                                 0         0         0       0            0           nvidia.com/gpu.deploy.driver=true                  52m
gpu-operator   nvidia-mig-manager                                      0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             52m
gpu-operator   nvidia-operator-validator                               1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true      52m

The gpu-node in question is running

  Namespace                   Name                                                           CPU Requests  CPU Limits   Memory Requests  Memory Limits  Age
  ---------                   ----                                                           ------------  ----------   ---------------  -------------  ---
  default                     celery-gpu-rap-generator-cf7878d89-bhbnb                       3500m (89%)   3500m (89%)  14Gi (95%)       14Gi (95%)     11h
  gpu-operator                gpu-feature-discovery-6d2qz                                    0 (0%)        0 (0%)       0 (0%)           0 (0%)         11h
  gpu-operator                gpu-operator-1687834555-node-feature-discovery-worker-qf85z    0 (0%)        0 (0%)       0 (0%)           0 (0%)         11h
  gpu-operator                nvidia-dcgm-exporter-wx4vt                                     0 (0%)        0 (0%)       0 (0%)           0 (0%)         11h
  gpu-operator                nvidia-device-plugin-daemonset-q58w9                           0 (0%)        0 (0%)       0 (0%)           0 (0%)         11h
  gpu-operator                nvidia-operator-validator-6qq2v                                0 (0%)        0 (0%)       0 (0%)           0 (0%)         11h
  kube-system                 aws-node-ljpx5                                                 25m (0%)      0 (0%)       0 (0%)           0 (0%)         11h
  kube-system                 kube-proxy-ht4f9                                               100m (2%)     0 (0%)       0 (0%)           0 (0%)         11h

[n] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
[n] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME

Thanks

The text was updated successfully, but these errors were encountered:

shivamerla · 2023-06-27T17:47:46Z

@sjkoelle yes, this is due to the pre-installed driver on the system already, we cannot overwrite it from the driver container. You would need to use stock ubuntu AMI with GPU node pool to support with the GPU operator.

shivamerla · 2023-06-27T17:55:16Z

This limitation has been documented here

sjkoelle · 2023-06-27T21:43:22Z

Thanks for the help. I moved onto a supported AMI and am now running

noglob helm install --kube-context <CONTEXT> -n gpu-operator --create-namespace nvidia/gpu-operator --wait --generate-name --set driver.repository=nvcr.io/nvidia --set driver.version='510.108.03'  --set psp.enabled=true --set migManager.enabled=false --set validator.plugin.env[0].name=WITH_WORKLOAD --set-string validator.plugin.env[0].value=false  --set cdi.enabled=true --set cdi.default=true

Took a while to roll out (not much happened for like 6 minutes after node refresh - assume this was installation).... but it seems to be working now!

sjkoelle · 2023-06-28T22:28:51Z

It would be great to be able to specify the operating system explicitly as an argument to the driver version. We would like to deploy with minimal downtime, and the old nodes are running Amazon Linux. The gpu operator pods don't seem to hurt these old nodes (just fail innocuously), but they also seem to detect that the operating system is Amazon Linux and pass that as an argument to the daemonset. For example, even if I manually edit the daemonset to use Ubuntu drivers, eventually this gets reset to Amazon Linux and fails. If it stayed as Ubuntu, I think it might work on the new nodes?

shivamerla · 2023-06-28T22:54:01Z

@sjkoelle we don't support mixed nodes, to get this working you can follow below steps.

Label Amazon Linux nodes with nvidia.com/gpu.deploy.operands=false
Provide driver version as image digest, i.e driver.version=sha256:<digest> instead of version tag. The operator will not apply OS version suffix in that case.

sjkoelle · 2023-06-30T22:35:24Z

@shivamerla Thank you for your amazing help on this. Looks like the true source of our error lies elsewhere to our driver version, but your suggestion totally worked and we are now able to test our update in prod without downtime.

sjkoelle · 2023-07-07T16:05:29Z

also linking this related thread for posterity awslabs/amazon-eks-ami#1060

sjkoelle · 2023-07-17T16:45:45Z

We are having some issues with nodes being tagged with Capacity and Allocatable nvidia.com/gpu 0 instead of 1

xyfleet · 2024-01-30T23:09:44Z

@shivamerla
I am using AWS EKS with GPU instances(Amazon Linux 2). All of these instances have the Nvidia driver pre-installed . Based on what you mentioned
"yes, this is due to the pre-installed driver on the system already, we cannot overwrite it from the driver container. You would need to use stock ubuntu AMI with GPU node pool to support with the GPU operator.
"

I have 3 questions:
1: Even my GPU operator has a newer version Nvidia driver, the driver in my host will not be upgraded, right?
2: How can I use GPU operator to manage the nvidia driver in the instance in EKS cluster, if the answer to question is Yes? (I want to keep the pre-installed Nvidia driver up to date in instance by upgrading the GPU operator. )
3: For the ubuntu instance, I only find this link for the Ubuntu. https://cloud-images.ubuntu.com/docs/aws/eks/
I think these AMIs support CPU. Do these AMIs support GPU? Where can you find the AMIs that support GPU?

sjkoelle changed the title ~~gpu-operator not applying version changes~~ GPU-operator not applying driver version changes Jun 27, 2023

sjkoelle changed the title ~~GPU-operator not applying driver version changes~~ GPU-operator not applying driver version changes on EKS Jun 27, 2023

sjkoelle mentioned this issue Jun 27, 2023

Please add nvidia driver version 520 for GPU enabled EKS AMI image awslabs/amazon-eks-ami#1060

Closed

sjkoelle mentioned this issue Jun 30, 2023

Pytorch not calling to C code from a docker container pytorch/pytorch#103752

Open

sjkoelle closed this as completed Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-operator not applying driver version changes on EKS #542

GPU-operator not applying driver version changes on EKS #542

sjkoelle commented Jun 26, 2023 •

edited

Loading

shivamerla commented Jun 27, 2023

shivamerla commented Jun 27, 2023

sjkoelle commented Jun 27, 2023 •

edited

Loading

sjkoelle commented Jun 28, 2023

shivamerla commented Jun 28, 2023 •

edited

Loading

sjkoelle commented Jun 30, 2023

sjkoelle commented Jul 7, 2023

sjkoelle commented Jul 17, 2023

xyfleet commented Jan 30, 2024 •

edited

Loading

GPU-operator not applying driver version changes on EKS #542

GPU-operator not applying driver version changes on EKS #542

Comments

sjkoelle commented Jun 26, 2023 • edited Loading

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

shivamerla commented Jun 27, 2023

shivamerla commented Jun 27, 2023

sjkoelle commented Jun 27, 2023 • edited Loading

sjkoelle commented Jun 28, 2023

shivamerla commented Jun 28, 2023 • edited Loading

sjkoelle commented Jun 30, 2023

sjkoelle commented Jul 7, 2023

sjkoelle commented Jul 17, 2023

xyfleet commented Jan 30, 2024 • edited Loading

sjkoelle commented Jun 26, 2023 •

edited

Loading

sjkoelle commented Jun 27, 2023 •

edited

Loading

shivamerla commented Jun 28, 2023 •

edited

Loading

xyfleet commented Jan 30, 2024 •

edited

Loading