BREAKS ON 1.25: Does not work on k8s 1.25 due to node API deprecation #458

sfxworks · 2022-12-07T18:18:42Z

1. Quick Debug Checklist

Are you running on an Ubuntu 18.04 node?
Are you running Kubernetes v1.13+?
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
Do you have i2c_core and ipmi_msghandler loaded on the nodes?
Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Deployed with helm, the operator attempts to reference a deprecated API object which prevents deployment.

As noted in https://kubernetes.io/docs/reference/using-api/deprecation-guide/#runtimeclass-v125
Nodes are now v1

kubectl get node home-2cf05d8a44a0 -o yaml | head -2                                                                                                                                                                                                                                             
apiVersion: v1
kind: Node

The operator cannot reconcile and deployment of a pod requesting a GPU fails as result

1.6704367753266153e+09  INFO    controllers.ClusterPolicy       Checking GPU state labels on the node   {"NodeName": "home-2cf05d8a44a0"}
1.6704367753266478e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.node-status-exporter", " value=": "true"}
1.670436775326656e+09   INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.operator-validator", " value=": "true"}
1.6704367753266625e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.driver", " value=": "true"}
1.6704367753266687e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.gpu-feature-discovery", " value=": "true"}
1.6704367753266747e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.container-toolkit", " value=": "true"}
1.670436775326681e+09   INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.device-plugin", " value=": "true"}
1.6704367753266864e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.dcgm", " value=": "true"}
1.6704367753266923e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.dcgm-exporter", " value=": "true"}
1.67043677532671e+09    INFO    controllers.ClusterPolicy       Number of nodes with GPU label  {"NodeCount": 1}
1.6704367753267498e+09  INFO    controllers.ClusterPolicy       Using container runtime: crio
1.6704367755844975e+09  ERROR   controller.clusterpolicy-controller     Reconciler error        {"name": "cluster-policy", "namespace": "", "error": "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\""}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227

2. Steps to reproduce the issue

Run Kubernetes 1.25
Deploy the helm operator

The text was updated successfully, but these errors were encountered:

sfxworks · 2022-12-07T18:21:01Z

According to this #401 (comment)
this change was applied, but the helm chart may not be referencing the latest image by default

cdesiniotis · 2022-12-07T20:31:50Z

@sfxworks what version of GPU Operator are you using? We migrated to node.k8s.io/v1 in v22.9.0

sfxworks · 2022-12-07T22:59:42Z

devel-ubi8 according to https://github.com/NVIDIA/gpu-operator/blob/master/deployments/gpu-operator/values.yaml#L50

sfxworks · 2022-12-07T23:10:01Z

nvidia-driver-daemonset-ttzrt 0/1 Init:0/1 0 22s 10.0.7.146 home-2cf05d8a44a0 <none> <none>

The tag you linked worked.

Though now other images are having issues with their defaults

  Normal   Pulling    70s (x4 over 2m39s)  kubelet            Pulling image "nvcr.io/nvidia/driver:525.60.13-"
  Warning  Failed     68s (x4 over 2m37s)  kubelet            Failed to pull image "nvcr.io/nvidia/driver:525.60.13-": rpc error: code = Unknown desc = reading manifest 525.60.13- in nvcr.io/nvidia/driver: manifest unknown: manifest unknown

Is there a publicly viewable way to see your registry's tags to resolve this quicker? They just time out.

sfxworks · 2022-12-07T23:14:39Z

..
Changing the version of the driver to latest in the helm chat adds a -, leading to an invalid image
image: nvcr.io/nvidia/driver:latest-

      containers:
      - args:
        - init
        command:
        - nvidia-driver
        image: nvcr.io/nvidia/driver:latest-
        imagePullPolicy: IfNotPresent
        name: nvidia-driver-ctr
        resources: {}
        securityContext:
          privileged: true

sfxworks · 2022-12-07T23:16:10Z

It doesn't like my kernel anyway I guess :/

Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 6.0.11-hardened1-1-hardened

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

sfxworks · 2022-12-08T14:21:25Z

Switching over the machine to linux vs liunx hardened with the above adjustments seems successful. Between then and now I did not have to adjust the daemonset either.

    nvidia.com/gpu.compute.major: "7"
    nvidia.com/gpu.compute.minor: "5"
    nvidia.com/gpu.count: "1"
    nvidia.com/gpu.deploy.container-toolkit: "true"
    nvidia.com/gpu.deploy.dcgm: "true"
    nvidia.com/gpu.deploy.dcgm-exporter: "true"

  Resource           Requests       Limits
  --------           --------       ------
  cpu                3300m (30%)    3500m (31%)
  memory             12488Mi (19%)  12638Mi (19%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
  nvidia.com/gpu     0              0

cdesiniotis · 2022-12-08T16:13:11Z

@sfxworks for installing the latest helm charts, please refer to: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-nvidia-gpu-operator.

We append a -<os> suffix (e.g. -ubuntu20.04) to match the OS of your worker nodes. We depend on labels from NFD (feature.node.kubernetes.io/system-os_release.ID and feature.node.kubernetes.io/system-os_release.VERSION_ID) for getting this information. If only - was appended then its possible these labels were missing. Concerning the kernel version, the driver container requires several kernel packages (i.e. kernel-devel). From your logs, it appears it could not find these packages for 6.0.11-hardened1-1-hardened. A workaround is to pass a custom repository file to the driver pod so it can properly find packages for the particular kernel. Following page has some details on how to do this: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/appendix.html#local-package-repository

sfxworks · 2023-03-28T07:53:49Z

I have feature.node.kubernetes.io/system-os_release.ID: arch though I do not have feature.node.kubernetes.io/system-os_release.VERSION_ID on any nodes (some manjaro based, some arch based). I cannot remember how I had this working before...

DatCanCode · 2023-05-12T04:30:36Z

I just installed GPU Operator with helm version v23.3.1. This version use nvcr.io/nvidia/gpu-operator:devel-ubi8 image which has exactly this error:

1.6704367755844975e+09  ERROR   controller.clusterpolicy-controller     Reconciler error        {"name": "cluster-policy", "namespace": "", "error": "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\""}

When I change GPU Operator to v22.9.2 version, it use nvcr.io/nvidia/gpu-operator:v22.9.0 image and the error disappeared. Can you please check it again @cdesiniotis

berlincount · 2023-07-30T10:23:58Z

I'm also running into this issue ... on both release-23.03 and master branches. microk8s v1.27.2 on Ubuntu 22.04.2 LTS.

acesir · 2023-08-01T11:43:08Z

I'm also running into this issue ... on both release-23.03 and master branches. microk8s v1.27.2 on Ubuntu 22.04.2 LTS.

Same error for us with EKS 1.27 and Ubuntu 22

shnigam2 · 2023-08-31T17:57:30Z

Same error with release 23.3.1 any solution ...?

robjcook · 2024-08-13T00:06:40Z

also running into this on Amazon Linux 2 - any known solution or workaround, something missing in the docs? trying to override the api version or look at the daemonset values next

release v24.6.1 - nvcr.io/nvidia/gpu-operator:devel-ubi8

ERROR controller.clusterpolicy-controller Reconciler error {"name": "cluster-policy", "namespace": "", "error": "no matches for kind "RuntimeClass" in version "node.k8s.io/v1beta1""}

kubectl describe node GPU-NODE | grep system
feature.node.kubernetes.io/system-os_release.ID=amzn
feature.node.kubernetes.io/system-os_release.VERSION_ID=2
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=2

node:
yum list installed | grep kernel
kernel.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10
kernel-devel.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10
kernel-headers.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10

tariq1890 · 2024-08-14T00:18:53Z

@robjcook it looks like you have deployed the local helm chart that is checked into the gpu-operator main branch. We don't recommend using that helm chart.

Please use the helm chart from the official helm repo as instructed here

robjcook · 2024-10-11T21:22:18Z

few node toleration things worked passed and changed to helm chart in the official helm repo

running into issue now where operator seems to be looking for image that does not exist and fails to pull

ImagePullBackOff (Back-off pulling image "nvcr.io/nvidia/driver:550.90.07-amzn2")

which image do recommend for Amazon Linux 2 node and where to specify instead of dynamically let operator interpret from the node?

edit: after digging through documentation looking into this now

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BREAKS ON 1.25: Does not work on k8s 1.25 due to node API deprecation #458

BREAKS ON 1.25: Does not work on k8s 1.25 due to node API deprecation #458

sfxworks commented Dec 7, 2022

sfxworks commented Dec 7, 2022

cdesiniotis commented Dec 7, 2022

sfxworks commented Dec 7, 2022

sfxworks commented Dec 7, 2022

sfxworks commented Dec 7, 2022 •

edited

Loading

sfxworks commented Dec 7, 2022 •

edited

Loading

sfxworks commented Dec 8, 2022

cdesiniotis commented Dec 8, 2022

sfxworks commented Mar 28, 2023

DatCanCode commented May 12, 2023

berlincount commented Jul 30, 2023 •

edited

Loading

acesir commented Aug 1, 2023

shnigam2 commented Aug 31, 2023

robjcook commented Aug 13, 2024 •

edited

Loading

tariq1890 commented Aug 14, 2024

robjcook commented Oct 11, 2024 •

edited

Loading

BREAKS ON 1.25: Does not work on k8s 1.25 due to node API deprecation #458

BREAKS ON 1.25: Does not work on k8s 1.25 due to node API deprecation #458

Comments

sfxworks commented Dec 7, 2022

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

sfxworks commented Dec 7, 2022

cdesiniotis commented Dec 7, 2022

sfxworks commented Dec 7, 2022

sfxworks commented Dec 7, 2022

sfxworks commented Dec 7, 2022 • edited Loading

sfxworks commented Dec 7, 2022 • edited Loading

sfxworks commented Dec 8, 2022

cdesiniotis commented Dec 8, 2022

sfxworks commented Mar 28, 2023

DatCanCode commented May 12, 2023

berlincount commented Jul 30, 2023 • edited Loading

acesir commented Aug 1, 2023

shnigam2 commented Aug 31, 2023

robjcook commented Aug 13, 2024 • edited Loading

tariq1890 commented Aug 14, 2024

robjcook commented Oct 11, 2024 • edited Loading

sfxworks commented Dec 7, 2022 •

edited

Loading

sfxworks commented Dec 7, 2022 •

edited

Loading

berlincount commented Jul 30, 2023 •

edited

Loading

robjcook commented Aug 13, 2024 •

edited

Loading

robjcook commented Oct 11, 2024 •

edited

Loading