Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU operator 22.9.2 installation is failing #541

Closed
3 tasks
likku123 opened this issue Jun 19, 2023 · 5 comments
Closed
3 tasks

GPU operator 22.9.2 installation is failing #541

likku123 opened this issue Jun 19, 2023 · 5 comments

Comments

@likku123
Copy link

likku123 commented Jun 19, 2023

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node? No
  • Are you running Kubernetes v1.13+? No
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?Docker version 20.10.21, build 20.10.21-0ubuntu1~22.04.3

1. Issue or feature description

I am trying to install specific version of GPU operator (22.9.2) via helm chart using ansible. Previously I am not specfying the version number and installing the latest . Just to be on a safer side I have specified the specific version to deploy.

image

I have collected logs based on the below instructions.

-->curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
-->chmod +x must-gather.sh
-->./must-gather.sh

gpu_operand_pod_nvidia-container-toolkit-daemonset-tx9zk.zip
[gpu_operand_pod_gpu-feature-discovery-57ds2.log](https://github.com/NVIDIA/gpu-operator/files/11784844/gpu_
gpu_operand_pod_nvidia-operator-validator-5zwwb.zip

gpu_operand_pod_nvidia-dcgm-exporter-5fl25.log
operand_pod_gpu-feature-discovery-57ds2.log)

Please let me know any more logs are required from my side

@likku123
Copy link
Author

Adding few more details.

nvidia-installer.log

/run/nvidia/driver# ls -lhrt
total 0

ls -la /usr/local/nvidia/toolkit
total 12920
drwxr-xr-x 3 root root 4096 Apr 11 12:36 .
drwxr-xr-x 3 root root 4096 Jun 18 23:33 ..
drwxr-xr-x 3 root root 4096 Apr 11 12:36 .config
lrwxrwxrwx 1 root root 32 Apr 11 12:36 libnvidia-container-go.so.1 -> libnvidia-container-go.so.1.11.0
-rw-r--r-- 1 root root 2959384 Apr 11 12:36 libnvidia-container-go.so.1.11.0
lrwxrwxrwx 1 root root 29 Apr 11 12:36 libnvidia-container.so.1 -> libnvidia-container.so.1.11.0
-rwxr-xr-x 1 root root 195856 Apr 11 12:36 libnvidia-container.so.1.11.0
-rwxr-xr-x 1 root root 154 Apr 11 12:36 nvidia-container-cli
-rwxr-xr-x 1 root root 47472 Apr 11 12:36 nvidia-container-cli.real
-rwxr-xr-x 1 root root 342 Apr 11 12:36 nvidia-container-runtime
-rwxr-xr-x 1 root root 429 Apr 11 12:36 nvidia-container-runtime-experimental
-rwxr-xr-x 1 root root 3771792 Apr 11 12:36 nvidia-container-runtime.experimental
-rwxr-xr-x 1 root root 203 Apr 11 12:36 nvidia-container-runtime-hook
-rwxr-xr-x 1 root root 2142088 Apr 11 12:36 nvidia-container-runtime-hook.real
-rwxr-xr-x 1 root root 4079040 Apr 11 12:36 nvidia-container-runtime.real
lrwxrwxrwx 1 root root 29 Apr 11 12:36 nvidia-container-toolkit -> nvidia-container-runtime-hook

ls -la /run/nvidia
total 0
drwxr-xr-x 4 root root 80 Jun 19 01:19 .
drwxr-xr-x 43 root root 1340 Jun 19 01:18 ..
drwxr-xr-x 2 root root 40 Jun 15 02:25 driver
drwxr-xr-x 2 root root 40 Jun 15 02:25 validations

@shivamerla
Copy link
Contributor

@likku123 can you attach logs from kubectl logs <nvidia-driver-daemonset-pod> -n <namespace> --all-containers > driver.log

@likku123
Copy link
Author

driver.log
driver1.log
driver2.log

We have three nodes and here are the logs for nodes daemonset logs

@shivamerla
Copy link
Contributor

Looks like linux-headers for kernel 5.15.0-71-generic are not available from Canonical, can you upgrade to later kernels 5.15.0-73-generic on these nodes.

@likku123
Copy link
Author

Yes, That works . Thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants