Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]kbcli addon enable nvidia-gpu-exporter failed on GKE #4949

Closed
ahjing99 opened this issue Sep 1, 2023 · 3 comments · Fixed by #6313
Closed

[BUG]kbcli addon enable nvidia-gpu-exporter failed on GKE #4949

ahjing99 opened this issue Sep 1, 2023 · 3 comments · Fixed by #6313
Assignees
Labels
bug kind/bug Something isn't working
Milestone

Comments

@ahjing99
Copy link
Collaborator

ahjing99 commented Sep 1, 2023

kbcli version
Kubernetes: v1.27.3-gke.100
KubeBlocks: 0.7.0-alpha.4
kbcli: 0.7.0-alpha.4

 kbcli addon enable nvidia-gpu-exporter
addon.extensions.kubeblocks.io/nvidia-gpu-exporter enabled

k get pod
NAME                                            READY   STATUS                 RESTARTS   AGE
csi-attacher-s3-0                               1/1     Running                0          4m7s
csi-provisioner-s3-0                            2/2     Running                0          4m7s
csi-s3-8nxsv                                    2/2     Running                0          4m7s
csi-s3-njdjz                                    2/2     Running                0          4m7s
csi-s3-v4fw9                                    2/2     Running                0          4m7s
install-neon-addon-44d69                        0/1     Error                  0          3m36s
install-neon-addon-5zvzx                        0/1     Error                  0          4m14s
install-neon-addon-nhv4s                        0/1     Error                  0          2m52s
install-neon-addon-vzftz                        0/1     Error                  0          3m59s
kb-addon-kubebench-fd7f9cd56-xn9r5              1/1     Running                0          13m
kb-addon-nvidia-gpu-exporter-82z5d              0/1     CreateContainerError   0          4s
kb-addon-nvidia-gpu-exporter-jjn8n              0/1     CreateContainerError   0          4s
kb-addon-nvidia-gpu-exporter-zfp9d              0/1     CreateContainerError   0          4s
kb-addon-snapshot-controller-65fcc74964-s57qd   1/1     Running                0          13m
kubeblocks-5bffff55b8-c8794                     1/1     Running                0          17m
kubeblocks-dataprotection-5d96f4b8cd-wc8xz      1/1     Running                0          17m

 k describe pod kb-addon-nvidia-gpu-exporter-82z5d
Name:         kb-addon-nvidia-gpu-exporter-82z5d
Namespace:    default
Priority:     0
Node:         gke-yjtest-default-pool-f59be211-2vqs/10.128.0.46
Start Time:   Fri, 01 Sep 2023 10:47:25 +0800
Labels:       app.kubernetes.io/instance=kb-addon-nvidia-gpu-exporter
              app.kubernetes.io/name=nvidia-gpu-exporter
              controller-revision-hash=74d969d6bd
              pod-template-generation=1
Annotations:  <none>
Status:       Pending
IP:           10.104.1.168
IPs:
  IP:           10.104.1.168
Controlled By:  DaemonSet/kb-addon-nvidia-gpu-exporter
Containers:
  nvidia-gpu-exporter:
    Container ID:
    Image:         docker.io/utkuozdemir/nvidia_gpu_exporter:0.3.0
    Image ID:
    Port:          9835/TCP
    Host Port:     0/TCP
    Args:
      --web.listen-address
      :9835
      --web.telemetry-path
      /metrics
      --nvidia-smi-command
      nvidia-smi
      --query-field-names
      AUTO
      --log.level
      info
      --log.format
      logfmt
    State:          Waiting
      Reason:       CreateContainerError
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /dev/nvidia0 from nvidia0 (rw)
      /dev/nvidiactl from nvidiactl (rw)
      /usr/bin/nvidia-smi from nvidia-smi (rw)
      /usr/lib/x86_64-linux-gnu/libnvidia-ml.so from libnvidia-ml-so (rw)
      /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 from libnvidia-ml-so-1 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xklqp (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  nvidiactl:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/nvidiactl
    HostPathType:
  nvidia0:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/nvidia0
    HostPathType:
  nvidia-smi:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/bin/nvidia-smi
    HostPathType:
  libnvidia-ml-so:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/lib64/libnvidia-ml.so
    HostPathType:
  libnvidia-ml-so-1:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/lib64/libnvidia-ml.so.1
    HostPathType:
  kube-api-access-xklqp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  20s               default-scheduler  Successfully assigned default/kb-addon-nvidia-gpu-exporter-82z5d to gke-yjtest-default-pool-f59be211-2vqs
  Warning  Failed     20s               kubelet            Error: failed to generate container "0f4412cde8b12a9725755ac71e5b6688f99ffcc2809ea5e02d05dbae4962a819" spec: failed to generate spec: failed to mkdir "/usr/bin/nvidia-smi": mkdir /usr/bin/nvidia-smi: read-only file system
  Warning  Failed     20s               kubelet            Error: failed to generate container "e9013e1bcda9a694c59cc5227ff079a69080e01137a5fa8a6f73d83763818bf5" spec: failed to generate spec: failed to mkdir "/usr/bin/nvidia-smi": mkdir /usr/bin/nvidia-smi: read-only file system
  Normal   Pulled     5s (x3 over 20s)  kubelet            Container image "docker.io/utkuozdemir/nvidia_gpu_exporter:0.3.0" already present on machine
  Warning  Failed     5s                kubelet            Error: failed to generate container "a0ab1f4da8105f5c3407805dd3cd85b05731aa6a6fd95b1c7d66eae1279a1964" spec: failed to generate spec: failed to mkdir "/usr/bin/nvidia-smi": mkdir /usr/bin/nvidia-smi: read-only file system
@ahjing99 ahjing99 added the kind/bug Something isn't working label Sep 1, 2023
@ahjing99 ahjing99 added this to the Release 0.7.0 milestone Sep 1, 2023
@ahjing99 ahjing99 changed the title [BUG]kbcli addon enable nvidia-gpu-exporter failed [BUG]kbcli addon enable nvidia-gpu-exporter failed on GKE Sep 1, 2023
@github-actions
Copy link

github-actions bot commented Oct 2, 2023

This issue has been marked as stale because it has been open for 30 days with no activity

@github-actions github-actions bot added the Stale label Oct 2, 2023
@ahjing99 ahjing99 modified the milestones: Release 0.7.0, Release 0.8.0 Nov 6, 2023
@iziang
Copy link
Contributor

iziang commented Jan 2, 2024

it's because the nvidia-gpu-exporter relies on the nvidia-smi binary in the host, but the container operating system of GKE doesn't have that binary.

@iziang iziang removed the Stale label Jan 2, 2024
@iziang
Copy link
Contributor

iziang commented Jan 2, 2024

the nvidia-gpu-exporter should be only available on EKS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants