-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configure #549
Comments
Additional info: apart from changing container runtime interface from docker to containerd, I have also tried different gpu-operator settings (values), with both CDI enabled/disabled, RDMA enabled/disabled and other - to no avail. |
did you ever figure this out @BartoszZawadzki? Dealing with same issue on EKS and ubuntu |
No, but since I'm using kops I've tried using this - https://kops.sigs.k8s.io/gpu/ and it worked out-of-the-box |
also meet this problem. How to solve it? |
|
|
No, Rocky Linux is not supported currently. |
I have attached logs from all containers deployed via gpu-operator helm chart in the inital issue. |
We're running into the same problem, the pods nivia-gpu-operator-node-feature-discovery-worker log nvidia-container-toolkit-daemonset log EDIT: Our problem is this issue in containerd which makes it impossible to additively use the imports to configure containerd plugins. In our case we're configuring registry mirrors which in turn completely overrides nvidias runtime configuration. We're probably going to have to go the same route as nvidia, meaning we'd have to somehow parse the |
Hi bro, I once encountered the same error. I'll give you my example for your reference. |
May this problem is failed of symlink creation. First you have to check your problem is from this situation. If right you can see error message like this below Now just follow that message. summarize
result is
|
Hey guys I have the exact same error as mentioned by @ordinaryparksee What's going on with these symlinks? I don't understand :/ |
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
I'm deploying gpu-operator from Helm chart using ArgoCD in my Kubernetes cluster (1.23.17), which is built using kops on AWS infrastructure (not EKS).
Now I've been struggling with this for a while now, I've had used both docker and containerd in my Kubernetes cluster as a container runtime engine. I'm currently running containerd v1.6.21
After deploying the gpu-operator this is what is happening in the
gpu-operator
namespace:Getting into more details on the pods that are stuck in the init state:
kubectl -n gpu-operator describe po gpu-feature-discovery-jtgll
kubectl -n gpu-operator describe po nvidia-dcgm-exporter-bpvks
kubectl -n gpu-operator describe po nvidia-device-plugin-daemonset-fwwgr
kubectl -n gpu-operator describe po nvidia-operator-validator-qjgsb
And finally my ClusterPolicy:
2. Steps to reproduce the issue
Deploy gpu-operator using Helm chart (23.3.2)
3. Information to attach (optional if deemed irrelevant)
kubernetes pods status:
kubectl get pods --all-namespaces
kubernetes daemonset status:
kubectl get ds --all-namespaces
If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
Output of running a container on the GPU machine:
docker run -it alpine echo foo
Docker configuration file:
cat /etc/docker/daemon.json
Docker runtime configuration:
docker info | grep runtime
NVIDIA shared directory:
ls -la /run/nvidia
NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
NVIDIA driver directory:
ls -la /run/nvidia/driver
kubelet logs
journalctl -u kubelet > kubelet.logs
The text was updated successfully, but these errors were encountered: