Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU operator DKMS build failure on 22.04 #303

Open
VariableDeclared opened this issue Oct 4, 2024 · 1 comment
Open

GPU operator DKMS build failure on 22.04 #303

VariableDeclared opened this issue Oct 4, 2024 · 1 comment

Comments

@VariableDeclared
Copy link
Contributor

Summary

When deploying Microk8s on an 22.04 Ubuntu enabled AWS machine a DKMS compile error is thrown:

/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c: In function 'test_events':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c:83:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=]
   83 | }
      | ^
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c: In function 'uvm_va_block_check_logical_permissions':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c:10755:60: warning: implicit conversion from 'uvm_fault_type_t' to 'uvm_fault_access_type_t' [-Wenum-conversion]
10755 |     uvm_prot_t access_prot = uvm_fault_access_type_to_prot(access_type);
      |                                                            ^~~~~~~~~~~
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c: In function 'block_cpu_fault_locked':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c:10890:53: warning: implicit conversion from 'uvm_fault_access_type_t' to 'uvm_fault_type_t' [-Wenum-conversion]
10890 |                                                     fault_access_type,
      |                                                     ^~~~~~~~~~~~~~~~~
make[2]: *** [/usr/src/linux-headers-6.8.0-1015-aws/Makefile:1925: /usr/src/nvidia-535.129.03/kernel] Error 2
make[1]: *** [Makefile:240: __sub-make] Error 2
make: *** [Makefile:82: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

This is likely due the older operator deploying some older versions of the driver which are missing the correct signatures for the later kernels. Deploying with the latest operator - it is able to deploy successfully:
microk8s enable gpu --version 24.6.2

gpu-operator-resources gpu-operator-node-feature-discovery-worker-pntfz 1/1 Running 0 9m3s
gpu-operator-resources gpu-operator-node-feature-discovery-worker-xcgxn 1/1 Running 0 9m3s
gpu-operator-resources gpu-operator-node-feature-discovery-worker-xxdlt 1/1 Running 0 9m3s
gpu-operator-resources nvidia-container-toolkit-daemonset-hv4hc 1/1 Running 0 8m38s
gpu-operator-resources nvidia-cuda-validator-cpkb7 0/1 Completed 0 3m54s
gpu-operator-resources nvidia-dcgm-exporter-s762v 1/1 Running 0 8m38s
gpu-operator-resources nvidia-device-plugin-daemonset-lh97z 1/1 Running 0 8m38s
gpu-operator-resources nvidia-driver-daemonset-t84r4 1/1 Running 0 8m44s
gpu-operator-resources nvidia-operator-validator-8cnnk 1/1 Running 0 8m38s
ingress nginx-ingress-microk8s-controller-f5v8r 1/1 Running 0 85m

inspection-report-20241004_143256.tar.gz

Reproduction Steps

  1. Deploy a GPU enabled machine juju add-machine --constraints='instance-type=g4dn.xlarge root-disk=100G'
  2. Microk8s enable gpu
  3. The daemonset will crash with a DKMS compile error

Introspection Report

Can you suggest a fix?

Change the default version to 24.6.2

https://github.com/canonical/microk8s-core-addons/blob/main/addons/nvidia/enable#L216

Are you interested in contributing with a fix?

@VariableDeclared
Copy link
Contributor Author

VariableDeclared commented Oct 4, 2024

Opened PR: #305

VariableDeclared added a commit to Barteus/demo-aws-mk8s-ckf-mlflow that referenced this issue Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant