You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When deploying Microk8s on an 22.04 Ubuntu enabled AWS machine a DKMS compile error is thrown:
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c: In function 'test_events':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c:83:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=]
83 | }
| ^
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c: In function 'uvm_va_block_check_logical_permissions':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c:10755:60: warning: implicit conversion from 'uvm_fault_type_t' to 'uvm_fault_access_type_t' [-Wenum-conversion]
10755 | uvm_prot_t access_prot = uvm_fault_access_type_to_prot(access_type);
| ^~~~~~~~~~~
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c: In function 'block_cpu_fault_locked':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c:10890:53: warning: implicit conversion from 'uvm_fault_access_type_t' to 'uvm_fault_type_t' [-Wenum-conversion]
10890 | fault_access_type,
| ^~~~~~~~~~~~~~~~~
make[2]: *** [/usr/src/linux-headers-6.8.0-1015-aws/Makefile:1925: /usr/src/nvidia-535.129.03/kernel] Error 2
make[1]: *** [Makefile:240: __sub-make] Error 2
make: *** [Makefile:82: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
This is likely due the older operator deploying some older versions of the driver which are missing the correct signatures for the later kernels. Deploying with the latest operator - it is able to deploy successfully:
microk8s enable gpu --version 24.6.2
Summary
When deploying Microk8s on an 22.04 Ubuntu enabled AWS machine a DKMS compile error is thrown:
This is likely due the older operator deploying some older versions of the driver which are missing the correct signatures for the later kernels. Deploying with the latest operator - it is able to deploy successfully:
microk8s enable gpu --version 24.6.2
gpu-operator-resources gpu-operator-node-feature-discovery-worker-pntfz 1/1 Running 0 9m3s
gpu-operator-resources gpu-operator-node-feature-discovery-worker-xcgxn 1/1 Running 0 9m3s
gpu-operator-resources gpu-operator-node-feature-discovery-worker-xxdlt 1/1 Running 0 9m3s
gpu-operator-resources nvidia-container-toolkit-daemonset-hv4hc 1/1 Running 0 8m38s
gpu-operator-resources nvidia-cuda-validator-cpkb7 0/1 Completed 0 3m54s
gpu-operator-resources nvidia-dcgm-exporter-s762v 1/1 Running 0 8m38s
gpu-operator-resources nvidia-device-plugin-daemonset-lh97z 1/1 Running 0 8m38s
gpu-operator-resources nvidia-driver-daemonset-t84r4 1/1 Running 0 8m44s
gpu-operator-resources nvidia-operator-validator-8cnnk 1/1 Running 0 8m38s
ingress nginx-ingress-microk8s-controller-f5v8r 1/1 Running 0 85m
inspection-report-20241004_143256.tar.gz
Reproduction Steps
juju add-machine --constraints='instance-type=g4dn.xlarge root-disk=100G'
Introspection Report
Can you suggest a fix?
Change the default version to 24.6.2
https://github.com/canonical/microk8s-core-addons/blob/main/addons/nvidia/enable#L216
Are you interested in contributing with a fix?
The text was updated successfully, but these errors were encountered: