Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BREAKS ON 1.25: Does not work on k8s 1.25 due to node API deprecation #458

Open
3 of 5 tasks
sfxworks opened this issue Dec 7, 2022 · 16 comments
Open
3 of 5 tasks

Comments

@sfxworks
Copy link

sfxworks commented Dec 7, 2022

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Deployed with helm, the operator attempts to reference a deprecated API object which prevents deployment.

As noted in https://kubernetes.io/docs/reference/using-api/deprecation-guide/#runtimeclass-v125
Nodes are now v1

kubectl get node home-2cf05d8a44a0 -o yaml | head -2                                                                                                                                                                                                                                             
apiVersion: v1
kind: Node

The operator cannot reconcile and deployment of a pod requesting a GPU fails as result

1.6704367753266153e+09  INFO    controllers.ClusterPolicy       Checking GPU state labels on the node   {"NodeName": "home-2cf05d8a44a0"}
1.6704367753266478e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.node-status-exporter", " value=": "true"}
1.670436775326656e+09   INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.operator-validator", " value=": "true"}
1.6704367753266625e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.driver", " value=": "true"}
1.6704367753266687e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.gpu-feature-discovery", " value=": "true"}
1.6704367753266747e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.container-toolkit", " value=": "true"}
1.670436775326681e+09   INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.device-plugin", " value=": "true"}
1.6704367753266864e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.dcgm", " value=": "true"}
1.6704367753266923e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.dcgm-exporter", " value=": "true"}
1.67043677532671e+09    INFO    controllers.ClusterPolicy       Number of nodes with GPU label  {"NodeCount": 1}
1.6704367753267498e+09  INFO    controllers.ClusterPolicy       Using container runtime: crio
1.6704367755844975e+09  ERROR   controller.clusterpolicy-controller     Reconciler error        {"name": "cluster-policy", "namespace": "", "error": "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\""}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227

2. Steps to reproduce the issue

  1. Run Kubernetes 1.25
  2. Deploy the helm operator
@sfxworks
Copy link
Author

sfxworks commented Dec 7, 2022

According to this #401 (comment)
this change was applied, but the helm chart may not be referencing the latest image by default

@cdesiniotis
Copy link
Contributor

@sfxworks what version of GPU Operator are you using? We migrated to node.k8s.io/v1 in v22.9.0

@sfxworks
Copy link
Author

sfxworks commented Dec 7, 2022

@sfxworks
Copy link
Author

sfxworks commented Dec 7, 2022

nvidia-driver-daemonset-ttzrt 0/1 Init:0/1 0 22s 10.0.7.146 home-2cf05d8a44a0 <none> <none>

The tag you linked worked.

Though now other images are having issues with their defaults

  Normal   Pulling    70s (x4 over 2m39s)  kubelet            Pulling image "nvcr.io/nvidia/driver:525.60.13-"
  Warning  Failed     68s (x4 over 2m37s)  kubelet            Failed to pull image "nvcr.io/nvidia/driver:525.60.13-": rpc error: code = Unknown desc = reading manifest 525.60.13- in nvcr.io/nvidia/driver: manifest unknown: manifest unknown

Is there a publicly viewable way to see your registry's tags to resolve this quicker? They just time out.

@sfxworks
Copy link
Author

sfxworks commented Dec 7, 2022

..
Changing the version of the driver to latest in the helm chat adds a -, leading to an invalid image
image: nvcr.io/nvidia/driver:latest-

      containers:
      - args:
        - init
        command:
        - nvidia-driver
        image: nvcr.io/nvidia/driver:latest-
        imagePullPolicy: IfNotPresent
        name: nvidia-driver-ctr
        resources: {}
        securityContext:
          privileged: true

@sfxworks
Copy link
Author

sfxworks commented Dec 7, 2022

It doesn't like my kernel anyway I guess :/

Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 6.0.11-hardened1-1-hardened

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

@sfxworks
Copy link
Author

sfxworks commented Dec 8, 2022

Switching over the machine to linux vs liunx hardened with the above adjustments seems successful. Between then and now I did not have to adjust the daemonset either.

    nvidia.com/gpu.compute.major: "7"
    nvidia.com/gpu.compute.minor: "5"
    nvidia.com/gpu.count: "1"
    nvidia.com/gpu.deploy.container-toolkit: "true"
    nvidia.com/gpu.deploy.dcgm: "true"
    nvidia.com/gpu.deploy.dcgm-exporter: "true"
  Resource           Requests       Limits
  --------           --------       ------
  cpu                3300m (30%)    3500m (31%)
  memory             12488Mi (19%)  12638Mi (19%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
  nvidia.com/gpu     0              0

@cdesiniotis
Copy link
Contributor

@sfxworks for installing the latest helm charts, please refer to: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-nvidia-gpu-operator.

We append a -<os> suffix (e.g. -ubuntu20.04) to match the OS of your worker nodes. We depend on labels from NFD (feature.node.kubernetes.io/system-os_release.ID and feature.node.kubernetes.io/system-os_release.VERSION_ID) for getting this information. If only - was appended then its possible these labels were missing. Concerning the kernel version, the driver container requires several kernel packages (i.e. kernel-devel). From your logs, it appears it could not find these packages for 6.0.11-hardened1-1-hardened. A workaround is to pass a custom repository file to the driver pod so it can properly find packages for the particular kernel. Following page has some details on how to do this: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/appendix.html#local-package-repository

@sfxworks
Copy link
Author

I have feature.node.kubernetes.io/system-os_release.ID: arch though I do not have feature.node.kubernetes.io/system-os_release.VERSION_ID on any nodes (some manjaro based, some arch based). I cannot remember how I had this working before...

@DatCanCode
Copy link

I just installed GPU Operator with helm version v23.3.1. This version use nvcr.io/nvidia/gpu-operator:devel-ubi8 image which has exactly this error:

1.6704367755844975e+09  ERROR   controller.clusterpolicy-controller     Reconciler error        {"name": "cluster-policy", "namespace": "", "error": "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\""}

When I change GPU Operator to v22.9.2 version, it use nvcr.io/nvidia/gpu-operator:v22.9.0 image and the error disappeared. Can you please check it again @cdesiniotis

@berlincount
Copy link

berlincount commented Jul 30, 2023

I'm also running into this issue ... on both release-23.03 and master branches. microk8s v1.27.2 on Ubuntu 22.04.2 LTS.

@acesir
Copy link

acesir commented Aug 1, 2023

I'm also running into this issue ... on both release-23.03 and master branches. microk8s v1.27.2 on Ubuntu 22.04.2 LTS.

Same error for us with EKS 1.27 and Ubuntu 22

@shnigam2
Copy link

Same error with release 23.3.1 any solution ...?

@robjcook
Copy link

robjcook commented Aug 13, 2024

also running into this on Amazon Linux 2 - any known solution or workaround, something missing in the docs? trying to override the api version or look at the daemonset values next

release v24.6.1 - nvcr.io/nvidia/gpu-operator:devel-ubi8

ERROR controller.clusterpolicy-controller Reconciler error {"name": "cluster-policy", "namespace": "", "error": "no matches for kind "RuntimeClass" in version "node.k8s.io/v1beta1""}

kubectl describe node GPU-NODE | grep system
feature.node.kubernetes.io/system-os_release.ID=amzn
feature.node.kubernetes.io/system-os_release.VERSION_ID=2
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=2

node:
yum list installed | grep kernel
kernel.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10
kernel-devel.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10
kernel-headers.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10

@tariq1890
Copy link
Contributor

@robjcook it looks like you have deployed the local helm chart that is checked into the gpu-operator main branch. We don't recommend using that helm chart.

Please use the helm chart from the official helm repo as instructed here

@robjcook
Copy link

robjcook commented Oct 11, 2024

few node toleration things worked passed and changed to helm chart in the official helm repo

running into issue now where operator seems to be looking for image that does not exist and fails to pull

ImagePullBackOff (Back-off pulling image "nvcr.io/nvidia/driver:550.90.07-amzn2")

which image do recommend for Amazon Linux 2 node and where to specify instead of dynamically let operator interpret from the node?

edit: after digging through documentation looking into this now

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants