Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU-operator not applying driver version changes on EKS #542

Closed
sjkoelle opened this issue Jun 26, 2023 · 9 comments
Closed

GPU-operator not applying driver version changes on EKS #542

sjkoelle opened this issue Jun 26, 2023 · 9 comments

Comments

@sjkoelle
Copy link

sjkoelle commented Jun 26, 2023

1. Quick Debug Checklist

  • [n] Are you running on an Ubuntu 18.04 node? 22.04
  • [y] Are you running Kubernetes v1.13+? 1.23 (now tried upgrading to 1.24 as well)
  • [y] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? docker 20.10
  • [?] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • [I think so (cluster policy reflects helm call).] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I am trying to update my nvidia drivers from 470 to 510 on amazon eks. I think that somehow the gpu-operator is not "seeing" the target node, since the operator is neither labelling the node as upgrade-required or detecting the node label as upgrade-required if I manually label it. Any help would be appreciated. Is it even possible to update driver versions for prebuilt AMIs?

2. Steps to reproduce the issue

In a suitable kube context, I ran

noglob helm install --kube-context <CONTEXT> -n gpu-operator --create-namespace nvidia/gpu-operator --wait --generate-name --set driver.repository=docker.io/nvidia --set driver.version="510.85.02" --set validator.plugin.env[0].name=WITH_WORKLOAD --set-string validator.plugin.env[0].value=false --set psp.enabled=true

All the pods seem to deploy properly, and the cluster-policy reflects the right version. However, the gpu operator logs don't seem to see the node that I am trying to update, as I get this message in the operator logs.

{"level":"info","ts":1687803309.8484328,"logger":"controllers.Upgrade","msg":"Node states:","Unknown":0,"upgrade-done":0,"upgrade-required":0,"cordon-required":0,"wait-for-jobs-required":0,"pod-deletion-required":0,"upgrade-failed":0,"drain-required":0,"pod-restart-required":0,"validation-required":0,"uncordon-required":0}
{"level":"info","ts":1687803309.8484778,"logger":"controllers.Upgrade","msg":"Upgrades in progress","currently in progress":0,"max parallel upgrades":1,"upgrade slots available":0,"currently unavailable nodes":0,"total number of nodes":0,"maximum nodes that can be unavailable":0}

Monitoring the update status with

kubectl get node -l nvidia.com/gpu.present -ojsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}{"\n"}{end}'

shows either nothing or upgrade-done. However, my target node is still running driver version 470 even after restart. I also tried manually labelling as upgrade-required

kubectl label node ip-10-0-1-222.us-west-2.compute.internal  nvidia.com/gpu-driver-upgrade-state=upgrade-required --overwrite

and now it seems stuck in that state, and the operator logs still show zero nodes with upgrade required. I have also noticed an error in the gpu operator pods

E0627 00:02:58.978808 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1beta1.PodSecurityPolicy: unknown (get podsecuritypolicies.policy)

even though psp is enabled.

I tried the above steps also with migManager=false (since I am running on a T4). After checking out #525 I have also tried

noglob helm install --kube-context <CONTEXT> -n gpu-operator --create-namespace nvidia/gpu-operator --wait --generate-name --set driver.repository=[docker.io/nvidia](http://docker.io/nvidia) --set driver.version=“510.108.03” --set validator.plugin.env[0].name=WITH_WORKLOAD --set-string validator.plugin.env[0].value=false --set psp.enabled=true --set migManager.enabled=false --set driver.enabled=false --set toolkit.enabled=false --set operator.runtimeClass=nvidia-container-runtime

Also tried upgrade to kube 1.24 and needed to remove the nvidia-runtime flag to make the gpu allocatable. In all cases, the problem seemed to remain.

3. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n gpu-operator
(base) ➜  uberduck git:(canary) ✗ kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-ckqkm                                       1/1     Running     0          31m
gpu-operator-1687801930-node-feature-discovery-master-799dpn9jq   1/1     Running     0          52m
gpu-operator-1687801930-node-feature-discovery-worker-dhs68       1/1     Running     0          31m
gpu-operator-1687801930-node-feature-discovery-worker-dl6z7       1/1     Running     0          52m
gpu-operator-1687801930-node-feature-discovery-worker-q6j7k       1/1     Running     0          52m
gpu-operator-78fc47fd7c-kc9rw                                     1/1     Running     0          41m
nvidia-container-toolkit-daemonset-dfb67                          1/1     Running     0          31m
nvidia-cuda-validator-btc48                                       0/1     Completed   0          31m
nvidia-dcgm-exporter-962r8                                        1/1     Running     0          31m
nvidia-device-plugin-daemonset-js2hz                              1/1     Running     0          31m
nvidia-operator-validator-62bzc                                   1/1     Running     0          31m
  • kubernetes daemonset status: kubectl get ds --all-namespaces
(base) ➜  uberduck git:(canary) ✗ kubectl get ds --all-namespaces
NAMESPACE      NAME                                                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-operator   gpu-feature-discovery                                   1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true   52m
gpu-operator   gpu-operator-1687801930-node-feature-discovery-worker   3         3         3       3            3           <none>                                             53m
gpu-operator   nvidia-container-toolkit-daemonset                      1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true       52m
gpu-operator   nvidia-dcgm-exporter                                    1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true           52m
gpu-operator   nvidia-device-plugin-daemonset                          1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true           52m
gpu-operator   nvidia-driver-daemonset                                 0         0         0       0            0           nvidia.com/gpu.deploy.driver=true                  52m
gpu-operator   nvidia-mig-manager                                      0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             52m
gpu-operator   nvidia-operator-validator                               1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true      52m

The gpu-node in question is running

  Namespace                   Name                                                           CPU Requests  CPU Limits   Memory Requests  Memory Limits  Age
  ---------                   ----                                                           ------------  ----------   ---------------  -------------  ---
  default                     celery-gpu-rap-generator-cf7878d89-bhbnb                       3500m (89%)   3500m (89%)  14Gi (95%)       14Gi (95%)     11h
  gpu-operator                gpu-feature-discovery-6d2qz                                    0 (0%)        0 (0%)       0 (0%)           0 (0%)         11h
  gpu-operator                gpu-operator-1687834555-node-feature-discovery-worker-qf85z    0 (0%)        0 (0%)       0 (0%)           0 (0%)         11h
  gpu-operator                nvidia-dcgm-exporter-wx4vt                                     0 (0%)        0 (0%)       0 (0%)           0 (0%)         11h
  gpu-operator                nvidia-device-plugin-daemonset-q58w9                           0 (0%)        0 (0%)       0 (0%)           0 (0%)         11h
  gpu-operator                nvidia-operator-validator-6qq2v                                0 (0%)        0 (0%)       0 (0%)           0 (0%)         11h
  kube-system                 aws-node-ljpx5                                                 25m (0%)      0 (0%)       0 (0%)           0 (0%)         11h
  kube-system                 kube-proxy-ht4f9                                               100m (2%)     0 (0%)       0 (0%)           0 (0%)         11h
  • [n] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
  • [n] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME

Thanks

@sjkoelle sjkoelle changed the title gpu-operator not applying version changes GPU-operator not applying driver version changes Jun 27, 2023
@sjkoelle sjkoelle changed the title GPU-operator not applying driver version changes GPU-operator not applying driver version changes on EKS Jun 27, 2023
@shivamerla
Copy link
Contributor

@sjkoelle yes, this is due to the pre-installed driver on the system already, we cannot overwrite it from the driver container. You would need to use stock ubuntu AMI with GPU node pool to support with the GPU operator.

@shivamerla
Copy link
Contributor

This limitation has been documented here

@sjkoelle
Copy link
Author

sjkoelle commented Jun 27, 2023

Thanks for the help. I moved onto a supported AMI and am now running

noglob helm install --kube-context <CONTEXT> -n gpu-operator --create-namespace nvidia/gpu-operator --wait --generate-name --set driver.repository=nvcr.io/nvidia --set driver.version='510.108.03'  --set psp.enabled=true --set migManager.enabled=false --set validator.plugin.env[0].name=WITH_WORKLOAD --set-string validator.plugin.env[0].value=false  --set cdi.enabled=true --set cdi.default=true

Took a while to roll out (not much happened for like 6 minutes after node refresh - assume this was installation).... but it seems to be working now!

@sjkoelle
Copy link
Author

It would be great to be able to specify the operating system explicitly as an argument to the driver version. We would like to deploy with minimal downtime, and the old nodes are running Amazon Linux. The gpu operator pods don't seem to hurt these old nodes (just fail innocuously), but they also seem to detect that the operating system is Amazon Linux and pass that as an argument to the daemonset. For example, even if I manually edit the daemonset to use Ubuntu drivers, eventually this gets reset to Amazon Linux and fails. If it stayed as Ubuntu, I think it might work on the new nodes?

@shivamerla
Copy link
Contributor

shivamerla commented Jun 28, 2023

@sjkoelle we don't support mixed nodes, to get this working you can follow below steps.

  1. Label Amazon Linux nodes with nvidia.com/gpu.deploy.operands=false
  2. Provide driver version as image digest, i.e driver.version=sha256:<digest> instead of version tag. The operator will not apply OS version suffix in that case.

@sjkoelle
Copy link
Author

@shivamerla Thank you for your amazing help on this. Looks like the true source of our error lies elsewhere to our driver version, but your suggestion totally worked and we are now able to test our update in prod without downtime.

@sjkoelle
Copy link
Author

sjkoelle commented Jul 7, 2023

also linking this related thread for posterity awslabs/amazon-eks-ami#1060

@sjkoelle sjkoelle closed this as completed Jul 7, 2023
@sjkoelle
Copy link
Author

We are having some issues with nodes being tagged with Capacity and Allocatable nvidia.com/gpu 0 instead of 1

@xyfleet
Copy link

xyfleet commented Jan 30, 2024

@shivamerla
I am using AWS EKS with GPU instances(Amazon Linux 2). All of these instances have the Nvidia driver pre-installed . Based on what you mentioned
"yes, this is due to the pre-installed driver on the system already, we cannot overwrite it from the driver container. You would need to use stock ubuntu AMI with GPU node pool to support with the GPU operator.
"

I have 3 questions:
1: Even my GPU operator has a newer version Nvidia driver, the driver in my host will not be upgraded, right?
2: How can I use GPU operator to manage the nvidia driver in the instance in EKS cluster, if the answer to question is Yes? (I want to keep the pre-installed Nvidia driver up to date in instance by upgrading the GPU operator. )
3: For the ubuntu instance, I only find this link for the Ubuntu. https://cloud-images.ubuntu.com/docs/aws/eks/
I think these AMIs support CPU. Do these AMIs support GPU? Where can you find the AMIs that support GPU?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants