[BUG]kbcli addon enable nvidia-gpu-exporter failed on GKE #4949

ahjing99 · 2023-09-01T02:48:56Z

kbcli version
Kubernetes: v1.27.3-gke.100
KubeBlocks: 0.7.0-alpha.4
kbcli: 0.7.0-alpha.4

 kbcli addon enable nvidia-gpu-exporter
addon.extensions.kubeblocks.io/nvidia-gpu-exporter enabled

k get pod
NAME                                            READY   STATUS                 RESTARTS   AGE
csi-attacher-s3-0                               1/1     Running                0          4m7s
csi-provisioner-s3-0                            2/2     Running                0          4m7s
csi-s3-8nxsv                                    2/2     Running                0          4m7s
csi-s3-njdjz                                    2/2     Running                0          4m7s
csi-s3-v4fw9                                    2/2     Running                0          4m7s
install-neon-addon-44d69                        0/1     Error                  0          3m36s
install-neon-addon-5zvzx                        0/1     Error                  0          4m14s
install-neon-addon-nhv4s                        0/1     Error                  0          2m52s
install-neon-addon-vzftz                        0/1     Error                  0          3m59s
kb-addon-kubebench-fd7f9cd56-xn9r5              1/1     Running                0          13m
kb-addon-nvidia-gpu-exporter-82z5d              0/1     CreateContainerError   0          4s
kb-addon-nvidia-gpu-exporter-jjn8n              0/1     CreateContainerError   0          4s
kb-addon-nvidia-gpu-exporter-zfp9d              0/1     CreateContainerError   0          4s
kb-addon-snapshot-controller-65fcc74964-s57qd   1/1     Running                0          13m
kubeblocks-5bffff55b8-c8794                     1/1     Running                0          17m
kubeblocks-dataprotection-5d96f4b8cd-wc8xz      1/1     Running                0          17m

 k describe pod kb-addon-nvidia-gpu-exporter-82z5d
Name:         kb-addon-nvidia-gpu-exporter-82z5d
Namespace:    default
Priority:     0
Node:         gke-yjtest-default-pool-f59be211-2vqs/10.128.0.46
Start Time:   Fri, 01 Sep 2023 10:47:25 +0800
Labels:       app.kubernetes.io/instance=kb-addon-nvidia-gpu-exporter
              app.kubernetes.io/name=nvidia-gpu-exporter
              controller-revision-hash=74d969d6bd
              pod-template-generation=1
Annotations:  <none>
Status:       Pending
IP:           10.104.1.168
IPs:
  IP:           10.104.1.168
Controlled By:  DaemonSet/kb-addon-nvidia-gpu-exporter
Containers:
  nvidia-gpu-exporter:
    Container ID:
    Image:         docker.io/utkuozdemir/nvidia_gpu_exporter:0.3.0
    Image ID:
    Port:          9835/TCP
    Host Port:     0/TCP
    Args:
      --web.listen-address
      :9835
      --web.telemetry-path
      /metrics
      --nvidia-smi-command
      nvidia-smi
      --query-field-names
      AUTO
      --log.level
      info
      --log.format
      logfmt
    State:          Waiting
      Reason:       CreateContainerError
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /dev/nvidia0 from nvidia0 (rw)
      /dev/nvidiactl from nvidiactl (rw)
      /usr/bin/nvidia-smi from nvidia-smi (rw)
      /usr/lib/x86_64-linux-gnu/libnvidia-ml.so from libnvidia-ml-so (rw)
      /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 from libnvidia-ml-so-1 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xklqp (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  nvidiactl:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/nvidiactl
    HostPathType:
  nvidia0:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/nvidia0
    HostPathType:
  nvidia-smi:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/bin/nvidia-smi
    HostPathType:
  libnvidia-ml-so:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/lib64/libnvidia-ml.so
    HostPathType:
  libnvidia-ml-so-1:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/lib64/libnvidia-ml.so.1
    HostPathType:
  kube-api-access-xklqp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  20s               default-scheduler  Successfully assigned default/kb-addon-nvidia-gpu-exporter-82z5d to gke-yjtest-default-pool-f59be211-2vqs
  Warning  Failed     20s               kubelet            Error: failed to generate container "0f4412cde8b12a9725755ac71e5b6688f99ffcc2809ea5e02d05dbae4962a819" spec: failed to generate spec: failed to mkdir "/usr/bin/nvidia-smi": mkdir /usr/bin/nvidia-smi: read-only file system
  Warning  Failed     20s               kubelet            Error: failed to generate container "e9013e1bcda9a694c59cc5227ff079a69080e01137a5fa8a6f73d83763818bf5" spec: failed to generate spec: failed to mkdir "/usr/bin/nvidia-smi": mkdir /usr/bin/nvidia-smi: read-only file system
  Normal   Pulled     5s (x3 over 20s)  kubelet            Container image "docker.io/utkuozdemir/nvidia_gpu_exporter:0.3.0" already present on machine
  Warning  Failed     5s                kubelet            Error: failed to generate container "a0ab1f4da8105f5c3407805dd3cd85b05731aa6a6fd95b1c7d66eae1279a1964" spec: failed to generate spec: failed to mkdir "/usr/bin/nvidia-smi": mkdir /usr/bin/nvidia-smi: read-only file system

The text was updated successfully, but these errors were encountered:

github-actions · 2023-10-02T00:16:39Z

This issue has been marked as stale because it has been open for 30 days with no activity

iziang · 2024-01-02T10:31:50Z

it's because the nvidia-gpu-exporter relies on the nvidia-smi binary in the host, but the container operating system of GKE doesn't have that binary.

iziang · 2024-01-02T10:32:58Z

the nvidia-gpu-exporter should be only available on EKS.

ahjing99 added the kind/bug Something isn't working label Sep 1, 2023

ahjing99 added this to the Release 0.7.0 milestone Sep 1, 2023

ahjing99 assigned iziang Sep 1, 2023

apecloud-bot added the bug label Sep 1, 2023

ahjing99 changed the title ~~[BUG]kbcli addon enable nvidia-gpu-exporter failed~~ [BUG]kbcli addon enable nvidia-gpu-exporter failed on GKE Sep 1, 2023

github-actions bot added the Stale label Oct 2, 2023

ahjing99 modified the milestones: Release 0.7.0, Release 0.8.0 Nov 6, 2023

iziang removed the Stale label Jan 2, 2024

iziang linked a pull request Jan 2, 2024 that will close this issue

chore: limit the nvidia-gpu-exporter only available on the EKS #6313

Merged

iziang closed this as completed in #6313 Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]kbcli addon enable nvidia-gpu-exporter failed on GKE #4949

[BUG]kbcli addon enable nvidia-gpu-exporter failed on GKE #4949

ahjing99 commented Sep 1, 2023

github-actions bot commented Oct 2, 2023

iziang commented Jan 2, 2024

iziang commented Jan 2, 2024

[BUG]kbcli addon enable nvidia-gpu-exporter failed on GKE #4949

[BUG]kbcli addon enable nvidia-gpu-exporter failed on GKE #4949

Comments

ahjing99 commented Sep 1, 2023

github-actions bot commented Oct 2, 2023

iziang commented Jan 2, 2024

iziang commented Jan 2, 2024