Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configure #549

BartoszZawadzki · 2023-07-12T15:13:55Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

Are you running on an Ubuntu 18.04 node?
Are you running Kubernetes v1.13+?
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
Do you have i2c_core and ipmi_msghandler loaded on the nodes?
Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I'm deploying gpu-operator from Helm chart using ArgoCD in my Kubernetes cluster (1.23.17), which is built using kops on AWS infrastructure (not EKS).

Now I've been struggling with this for a while now, I've had used both docker and containerd in my Kubernetes cluster as a container runtime engine. I'm currently running containerd v1.6.21

After deploying the gpu-operator this is what is happening in the gpu-operator namespace:

NAME                                                         READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-jtgll                                  0/1     Init:0/1   0          11m
gpu-feature-discovery-m82hx                                  0/1     Init:0/1   0          11m
gpu-feature-discovery-rzkzj                                  0/1     Init:0/1   0          11m
gpu-operator-6489b6d9-d5smv                                  1/1     Running    0          11m
gpu-operator-node-feature-discovery-master-86dd7c646-6jvns   1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-5r7g6             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-5v7bn             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-6lzkk             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-7z6zw             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-8t9hk             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-b7k2t             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-fz7f2             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-hdp28             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-j9f45             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-rqx4l             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-svk5h             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-v6rx9             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-wd7h7             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-wqsp5             1/1     Running    0          11m
gpu-operator-node-feature-discovery-worker-xf7m6             1/1     Running    0          11m
nvidia-container-toolkit-daemonset-26djz                     1/1     Running    0          11m
nvidia-container-toolkit-daemonset-72mvg                     1/1     Running    0          11m
nvidia-container-toolkit-daemonset-trk6f                     1/1     Running    0          11m
nvidia-dcgm-exporter-bpvks                                   0/1     Init:0/1   0          11m
nvidia-dcgm-exporter-cchvm                                   0/1     Init:0/1   0          11m
nvidia-dcgm-exporter-fd98x                                   0/1     Init:0/1   0          11m
nvidia-device-plugin-daemonset-fwwgr                         0/1     Init:0/1   0          11m
nvidia-device-plugin-daemonset-kblb6                         0/1     Init:0/1   0          11m
nvidia-device-plugin-daemonset-zlgdm                         0/1     Init:0/1   0          11m
nvidia-driver-daemonset-mg5g8                                1/1     Running    0          11m
nvidia-driver-daemonset-tschz                                1/1     Running    0          11m
nvidia-driver-daemonset-x285r                                1/1     Running    0          11m
nvidia-operator-validator-qjgsb                              0/1     Init:0/4   0          11m
nvidia-operator-validator-trlfn                              0/1     Init:0/4   0          11m
nvidia-operator-validator-vtkdz                              0/1     Init:0/4   0          11m

Getting into more details on the pods that are stuck in the init state:
kubectl -n gpu-operator describe po gpu-feature-discovery-jtgll

Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               23m                   default-scheduler  Successfully assigned gpu-operator/gpu-feature-discovery-jtgll to ip-172-20-99-192.eu-west-1.compute.internal
  Warning  FailedCreatePodSandBox  3m46s (x93 over 23m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

kubectl -n gpu-operator describe po nvidia-dcgm-exporter-bpvks

Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               24m                   default-scheduler  Successfully assigned gpu-operator/nvidia-dcgm-exporter-bpvks to ip-172-20-45-35.eu-west-1.compute.internal
  Warning  FailedCreatePodSandBox  22m                   kubelet            Failed to create pod sandbox: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
  Warning  FailedCreatePodSandBox  4m43s (x93 over 24m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

kubectl -n gpu-operator describe po nvidia-device-plugin-daemonset-fwwgr

Events:
  Type     Reason                  Age                  From               Message
  ----     ------                  ----                 ----               -------
  Normal   Scheduled               25m                  default-scheduler  Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-fwwgr to ip-172-20-99-192.eu-west-1.compute.internal
  Warning  FailedCreatePodSandBox  23m                  kubelet            Failed to create pod sandbox: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
  Warning  FailedCreatePodSandBox  31s (x117 over 25m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

kubectl -n gpu-operator describe po nvidia-operator-validator-qjgsb

Events:
  Type     Reason                  Age                  From               Message
  ----     ------                  ----                 ----               -------
  Normal   Scheduled               26m                  default-scheduler  Successfully assigned gpu-operator/nvidia-operator-validator-qjgsb to ip-172-20-99-192.eu-west-1.compute.internal
  Warning  FailedCreatePodSandBox  80s (x117 over 26m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

And finally my ClusterPolicy:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  annotations:
    helm.sh/resource-policy: keep
  creationTimestamp: "2023-07-12T14:42:17Z"
  generation: 1
  labels:
    app.kubernetes.io/component: gpu-operator
    app.kubernetes.io/instance: gpu-operator
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: gpu-operator
    app.kubernetes.io/version: v23.3.2
    argocd.argoproj.io/instance: gpu-operator
    helm.sh/chart: gpu-operator-v23.3.2
  name: cluster-policy
  resourceVersion: "223035606"
  uid: 961e3b87-a5ff-47d9-944d-f9cca9e72fa9
spec:
  cdi:
    default: false
    enabled: false
  daemonsets:
    labels:
      app.kubernetes.io/managed-by: gpu-operator
      helm.sh/chart: gpu-operator-v23.3.2
    priorityClassName: system-node-critical
    rollingUpdate:
      maxUnavailable: "1"
    tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists
    updateStrategy: RollingUpdate
  dcgm:
    enabled: false
    hostPort: 5555
    image: dcgm
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: 3.1.7-1-ubuntu20.04
  dcgmExporter:
    enabled: true
    env:
    - name: DCGM_EXPORTER_LISTEN
      value: :9400
    - name: DCGM_EXPORTER_KUBERNETES
      value: "true"
    - name: DCGM_EXPORTER_COLLECTORS
      value: /etc/dcgm-exporter/dcp-metrics-included.csv
    image: dcgm-exporter
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/k8s
    serviceMonitor:
      additionalLabels: {}
      enabled: false
      honorLabels: false
      interval: 15s
    version: 3.1.7-3.1.4-ubuntu20.04
  devicePlugin:
    enabled: true
    env:
    - name: PASS_DEVICE_SPECS
      value: "true"
    - name: FAIL_ON_INIT_ERROR
      value: "true"
    - name: DEVICE_LIST_STRATEGY
      value: envvar
    - name: DEVICE_ID_STRATEGY
      value: uuid
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
    image: k8s-device-plugin
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia
    version: v0.14.0-ubi8
  driver:
    certConfig:
      name: ""
    enabled: true
    image: driver
    imagePullPolicy: IfNotPresent
    kernelModuleConfig:
      name: ""
    licensingConfig:
      configMapName: ""
      nlsEnabled: false
    manager:
      env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "true"
      - name: ENABLE_AUTO_DRAIN
        value: "false"
      - name: DRAIN_USE_FORCE
        value: "false"
      - name: DRAIN_POD_SELECTOR_LABEL
        value: ""
      - name: DRAIN_TIMEOUT_SECONDS
        value: 0s
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "false"
      image: k8s-driver-manager
      imagePullPolicy: IfNotPresent
      repository: nvcr.io/nvidia/cloud-native
      version: v0.6.1
    rdma:
      enabled: false
      useHostMofed: false
    repoConfig:
      configMapName: ""
    repository: nvcr.io/nvidia
    startupProbe:
      failureThreshold: 120
      initialDelaySeconds: 60
      periodSeconds: 10
      timeoutSeconds: 60
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    usePrecompiled: false
    version: 525.105.17
    virtualTopology:
      config: ""
  gfd:
    enabled: true
    env:
    - name: GFD_SLEEP_INTERVAL
      value: 60s
    - name: GFD_FAIL_ON_INIT_ERROR
      value: "true"
    image: gpu-feature-discovery
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia
    version: v0.8.0-ubi8
  mig:
    strategy: single
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
    env:
    - name: WITH_REBOOT
      value: "false"
    gpuClientsConfig:
      name: ""
    image: k8s-mig-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: v0.5.2-ubuntu20.04
  nodeStatusExporter:
    enabled: false
    image: gpu-operator-validator
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: v23.3.2
  operator:
    defaultRuntime: containerd
    initContainer:
      image: cuda
      imagePullPolicy: IfNotPresent
      repository: nvcr.io/nvidia
      version: 12.1.1-base-ubi8
    runtimeClass: nvidia
  psp:
    enabled: false
  sandboxDevicePlugin:
    enabled: true
    image: kubevirt-gpu-device-plugin
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia
    version: v1.2.1
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  toolkit:
    enabled: true
    image: container-toolkit
    imagePullPolicy: IfNotPresent
    installDir: /usr/local/nvidia
    repository: nvcr.io/nvidia/k8s
    version: v1.13.0-ubuntu20.04
  validator:
    image: gpu-operator-validator
    imagePullPolicy: IfNotPresent
    plugin:
      env:
      - name: WITH_WORKLOAD
        value: "true"
    repository: nvcr.io/nvidia/cloud-native
    version: v23.3.2
  vfioManager:
    driverManager:
      env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "false"
      - name: ENABLE_AUTO_DRAIN
        value: "false"
      image: k8s-driver-manager
      imagePullPolicy: IfNotPresent
      repository: nvcr.io/nvidia/cloud-native
      version: v0.6.1
    enabled: true
    image: cuda
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia
    version: 12.1.1-base-ubi8
  vgpuDeviceManager:
    config:
      default: default
      name: ""
    enabled: true
    image: vgpu-device-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: v0.2.1
  vgpuManager:
    driverManager:
      env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "false"
      - name: ENABLE_AUTO_DRAIN
        value: "false"
      image: k8s-driver-manager
      imagePullPolicy: IfNotPresent
      repository: nvcr.io/nvidia/cloud-native
      version: v0.6.1
    enabled: false
    image: vgpu-manager
    imagePullPolicy: IfNotPresent
status:
  namespace: gpu-operator
  state: notReady

2. Steps to reproduce the issue

Deploy gpu-operator using Helm chart (23.3.2)

3. Information to attach (optional if deemed irrelevant)

The text was updated successfully, but these errors were encountered:

BartoszZawadzki · 2023-07-12T15:17:19Z

Additional info: apart from changing container runtime interface from docker to containerd, I have also tried different gpu-operator settings (values), with both CDI enabled/disabled, RDMA enabled/disabled and other - to no avail.

acesir · 2023-07-30T23:45:14Z

did you ever figure this out @BartoszZawadzki? Dealing with same issue on EKS and ubuntu

BartoszZawadzki · 2023-07-31T08:45:03Z

No, but since I'm using kops I've tried using this - https://kops.sigs.k8s.io/gpu/ and it worked out-of-the-box

sunhailin-Leo · 2023-08-14T12:08:14Z

also meet this problem. How to solve it?

shivamerla · 2023-08-14T15:16:05Z

failed to get sandbox runtime: no runtime for "nvidia" this is a very generic error that happens when the container-toolkit is not able to apply the runtime config successfully or driver install is not working. Please look at the status/logs of nvidia-driver-daemonset and nvidia-container-toolkit pods to figure out the actual error.

sunhailin-Leo · 2023-08-14T19:40:51Z

@shivamerla

gpu-operator support Rocky Linux 9.1 (Blue Onyx)?

shivamerla · 2023-08-14T21:22:13Z

No, Rocky Linux is not supported currently.

BartoszZawadzki · 2023-08-16T09:41:31Z

failed to get sandbox runtime: no runtime for "nvidia" this is a very generic error that happens when the container-toolkit is not able to apply the runtime config successfully or driver install is not working. Please look at the status/logs of nvidia-driver-daemonset and nvidia-container-toolkit pods to figure out the actual error.

I have attached logs from all containers deployed via gpu-operator helm chart in the inital issue.

cwrau · 2023-08-17T13:24:19Z

We're running into the same problem, the pods gpu-feature-discovery, nvidia-operator-validator, nvidia-dcgm-exporter and nvidia-device-plugin-daemonset are all not starting because of Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

nivia-gpu-operator-node-feature-discovery-worker log

nvidia-driver log

nvidia-container-toolkit-daemonset log

EDIT: Our problem is this issue in containerd which makes it impossible to additively use the imports to configure containerd plugins. In our case we're configuring registry mirrors which in turn completely overrides nvidias runtime configuration. We're probably going to have to go the same route as nvidia, meaning we'd have to somehow parse the config.toml, add our config and write it back.

waterfeeds · 2023-12-22T09:12:53Z

Hi bro, I once encountered the same error. I'll give you my example for your reference.
A week ago, I installed the nvidia driver, toolkits and device-plugin manually for test gpu running. I run containerd as runtime for kubelet, on ubuntu 22.04, then it works on cuda testing.
A few days ago I tried gpu-operator installation, before that i uninstall nvidia driver, toolkits and device-plugin, and reverted the /etc/containerd/config.toml config. I got the same error as you.I had read many old issues about this err, then I found a committer of gpu-operator recommended lsmod | grep nvidia command, so I found some nvidia driver using by ubuntu kernel, meaned that uninstall imcompletely, so i reboot my host, and lsmod | grep nvidia command get nothing. Glad to say, everything is ok, all the nvidia pod become running.
Hope useful to you !

ordinaryparksee · 2024-06-19T07:41:52Z

May this problem is failed of symlink creation.
I dont think it's a good way but you can avoid this issue by disabling symlink creation.

First you have to check your problem is from this situation.
kubectl logs -f nvidia-container-toolkit-daemonset-j8wcf -n gpu-operator-resources -c driver-validation

If right you can see error message like this below
time="2024-06-19T07:21:42Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to create NVIDIA device nodes: failed to create device node nvidiactl: failed to determine major: invalid device node\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n validator:\n driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\""

Now just follow that message.

summarize

open clusterpolicy with this command kubectl edit clusterpolicies.nvidia.com
find validator: and add driver: part.

result is

  validator:
    driver:
      env:
      - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
        value: "true"
    image: gpu-operator-validator
    imagePullPolicy: IfNotPresent
    plugin:
      env:
      - name: WITH_WORKLOAD
        value: "false"
    repository: nvcr.io/nvidia/cloud-native
    version: v23.9.1

choucavalier · 2024-06-27T14:08:16Z

Hey guys I have the exact same error as mentioned by @ordinaryparksee

What's going on with these symlinks? I don't understand :/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configure #549

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configure #549

BartoszZawadzki commented Jul 12, 2023 •

edited

Loading

BartoszZawadzki commented Jul 12, 2023

acesir commented Jul 30, 2023

BartoszZawadzki commented Jul 31, 2023

sunhailin-Leo commented Aug 14, 2023

shivamerla commented Aug 14, 2023

sunhailin-Leo commented Aug 14, 2023

shivamerla commented Aug 14, 2023

BartoszZawadzki commented Aug 16, 2023

cwrau commented Aug 17, 2023 •

edited

Loading

waterfeeds commented Dec 22, 2023

ordinaryparksee commented Jun 19, 2024 •

edited

Loading

choucavalier commented Jun 27, 2024

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configure #549

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configure #549

Comments

BartoszZawadzki commented Jul 12, 2023 • edited Loading

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

BartoszZawadzki commented Jul 12, 2023

acesir commented Jul 30, 2023

BartoszZawadzki commented Jul 31, 2023

sunhailin-Leo commented Aug 14, 2023

shivamerla commented Aug 14, 2023

sunhailin-Leo commented Aug 14, 2023

shivamerla commented Aug 14, 2023

BartoszZawadzki commented Aug 16, 2023

cwrau commented Aug 17, 2023 • edited Loading

waterfeeds commented Dec 22, 2023

ordinaryparksee commented Jun 19, 2024 • edited Loading

choucavalier commented Jun 27, 2024

BartoszZawadzki commented Jul 12, 2023 •

edited

Loading

cwrau commented Aug 17, 2023 •

edited

Loading

ordinaryparksee commented Jun 19, 2024 •

edited

Loading