Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not resolve Linux kernel version on GKE 1.25.* + GPU Operator version: 23.3.1 #526

Closed
5 tasks
xcheng85 opened this issue May 12, 2023 · 9 comments
Closed
5 tasks

Comments

@xcheng85
Copy link

  1. Quick Debug Checklist
  • Are you running on an Ubuntu 18.04 node? No, Ubuntu22.04
  • Are you running Kubernetes v1.13+? Yes, Kubernetes v1.25
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? No
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces) Yes
  1. Issue or feature description
    Could not resolve Linux kernel version in daemonset pod.

  2. Steps to reproduce the issue
    https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/google-gke.html

  3. Information to attach (optional if deemed irrelevant)

Node kernel version: 5.15.0-1028-gke on Ubuntu 22.04 node.
GPU Operator version: 23.3.1

  1. Could you please help with the drop-in replacement file for Ubuntu ?
    https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/google-gke.html only has the instruction for Centos.

Thank you very much.

@nojnhuh
Copy link

nojnhuh commented May 15, 2023

Our tests for cluster-api-provider-azure are also facing a very similar issue running the 5.15.0-1035-azure kernel on Ubuntu 22.04 with GPU operator v23.3.1. The last time the same test passed was about a week ago on 8 May.

@philroche
Copy link

@xcheng85 Kernel headers, modules and modules-extras for GKE kernel versions 5.15.0.1027, 5.15.0.1028 and 5.15.0.1030 have now restored to the archive.

There has also been an exception to any archive pruning/deleting made for future gke
and gkeop (Anthos on VMware) kernel header and module packages due to the
GKE deployments being long-lived and often having older kernels installed.

See https://lists.ubuntu.com/archives/ubuntu-devel/2023-May/042571.html for context

@francisguillier
Copy link
Contributor

francisguillier commented May 17, 2023

thanks @philroche for the info

I tried to deploy GPU Operator 23.3.1 on GKE cluster (using Ubuntu nodes with containerd) and I get this state:

```
$ kubectl get pod -n gpu-operator
NAME                                                              READY   STATUS             RESTARTS      AGE
gpu-feature-discovery-rnngc                                       0/1     Init:0/1           0             9m49s
gpu-operator-1684353094-node-feature-discovery-master-98cdxnpqp   1/1     Running            0             53m
gpu-operator-1684353094-node-feature-discovery-worker-dh8rc       1/1     Running            0             53m
gpu-operator-b59d9785-wbkft                                       1/1     Running            0             53m
nvidia-container-toolkit-daemonset-sjqt9                          0/1     Init:0/1           0             9m48s
nvidia-dcgm-exporter-wwdjd                                        0/1     Init:0/1           0             9m48s
nvidia-device-plugin-daemonset-q5ptf                              0/1     Init:0/1           0             9m48s
nvidia-driver-daemonset-6cpzg                                     0/1     CrashLoopBackOff   5 (36s ago)   9m53s
nvidia-operator-validator-s494x                                   0/1     Init:0/4           0             9m49s
```

logs from the driver container:

```
========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 525.105.17 for Linux kernel version 5.15.0-1028-gke

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.15.0-1028-gke
Installing Linux kernel headers...
Installing Linux kernel module files...
**E: Can't select candidate version from package linux-image-5.15.0-1028-gke as it has no candidate**
Generating Linux kernel version string...
ls: cannot access 'boot/vmlinuz-*': No such file or directory
Could not locate Linux kernel version string
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
```

is it also planned to restore the kernel image package?

@philroche
Copy link

@francisguillier re-publication of linux-image-* now in progress too. See https://launchpad.net/ubuntu/+source/linux-signed-gke/+publishinghistory

@francisguillier
Copy link
Contributor

@philroche
do you have plans to perform the same restoration for Azure kernels? (per the second comment on this Github issue, @nojnhuh faced similar issue with 5.15.0-1035-azure kernel)

@philroche
Copy link

Yes. There are plans to restore azure packages too. I will update here once complete

@francisguillier
Copy link
Contributor

@philroche I confirm everything works fine now on GKE (Kernel is 5.15.0-1028-gke)

```
$ kubectl get nodes -o wide
NAME                                       STATUS   ROLES    AGE     VERSION           INTERNAL-IP     EXTERNAL-IP    OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
gke-cluster-1-default-pool-1142eae3-kmmv   Ready    <none>   3h39m   v1.25.8-gke.500   10.138.15.202   35.230.96.50   Ubuntu 22.04.2 LTS   5.15.0-1028-gke   containerd://1.6.18

```
```
$ kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS      RESTARTS       AGE
gpu-feature-discovery-rnngc                                       1/1     Running     0              164m
gpu-operator-1684353094-node-feature-discovery-master-98cdxnpqp   1/1     Running     0              3h28m
gpu-operator-1684353094-node-feature-discovery-worker-dh8rc       1/1     Running     0              3h28m
gpu-operator-b59d9785-wbkft                                       1/1     Running     0              3h28m
nvidia-container-toolkit-daemonset-sjqt9                          1/1     Running     0              164m
nvidia-cuda-validator-x8v6p                                       0/1     Completed   0              74m
nvidia-dcgm-exporter-wwdjd                                        1/1     Running     0              164m
nvidia-device-plugin-daemonset-q5ptf                              1/1     Running     0              164m
nvidia-device-plugin-validator-hls25                              0/1     Completed   0              74m
nvidia-driver-daemonset-6cpzg                                     1/1     Running     18 (84m ago)   164m
nvidia-operator-validator-s494x                                   1/1     Running     0              164m
```

@philroche
Copy link

Re-publication of the Azure kernel packages noted here has now started - See https://launchpad.net/ubuntu/+source/linux-azure/+publishinghistory and https://launchpad.net/ubuntu/+source/linux-signed-azure/+publishinghistory

@shivamerla
Copy link
Contributor

Closing this as packages were made available again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants