GPU operator DKMS build failure on 22.04 #303

VariableDeclared · 2024-10-04T14:35:12Z

Summary

When deploying Microk8s on an 22.04 Ubuntu enabled AWS machine a DKMS compile error is thrown:

/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c: In function 'test_events':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_perf_events_test.c:83:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=]
   83 | }
      | ^
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c: In function 'uvm_va_block_check_logical_permissions':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c:10755:60: warning: implicit conversion from 'uvm_fault_type_t' to 'uvm_fault_access_type_t' [-Wenum-conversion]
10755 |     uvm_prot_t access_prot = uvm_fault_access_type_to_prot(access_type);
      |                                                            ^~~~~~~~~~~
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c: In function 'block_cpu_fault_locked':
/usr/src/nvidia-535.129.03/kernel/nvidia-uvm/uvm_va_block.c:10890:53: warning: implicit conversion from 'uvm_fault_access_type_t' to 'uvm_fault_type_t' [-Wenum-conversion]
10890 |                                                     fault_access_type,
      |                                                     ^~~~~~~~~~~~~~~~~
make[2]: *** [/usr/src/linux-headers-6.8.0-1015-aws/Makefile:1925: /usr/src/nvidia-535.129.03/kernel] Error 2
make[1]: *** [Makefile:240: __sub-make] Error 2
make: *** [Makefile:82: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

This is likely due the older operator deploying some older versions of the driver which are missing the correct signatures for the later kernels. Deploying with the latest operator - it is able to deploy successfully:
microk8s enable gpu --version 24.6.2

gpu-operator-resources gpu-operator-node-feature-discovery-worker-pntfz 1/1 Running 0 9m3s
gpu-operator-resources gpu-operator-node-feature-discovery-worker-xcgxn 1/1 Running 0 9m3s
gpu-operator-resources gpu-operator-node-feature-discovery-worker-xxdlt 1/1 Running 0 9m3s
gpu-operator-resources nvidia-container-toolkit-daemonset-hv4hc 1/1 Running 0 8m38s
gpu-operator-resources nvidia-cuda-validator-cpkb7 0/1 Completed 0 3m54s
gpu-operator-resources nvidia-dcgm-exporter-s762v 1/1 Running 0 8m38s
gpu-operator-resources nvidia-device-plugin-daemonset-lh97z 1/1 Running 0 8m38s
gpu-operator-resources nvidia-driver-daemonset-t84r4 1/1 Running 0 8m44s
gpu-operator-resources nvidia-operator-validator-8cnnk 1/1 Running 0 8m38s
ingress nginx-ingress-microk8s-controller-f5v8r 1/1 Running 0 85m

inspection-report-20241004_143256.tar.gz

Reproduction Steps

Deploy a GPU enabled machine juju add-machine --constraints='instance-type=g4dn.xlarge root-disk=100G'
Microk8s enable gpu
The daemonset will crash with a DKMS compile error

Introspection Report

Can you suggest a fix?

Change the default version to 24.6.2

https://github.com/canonical/microk8s-core-addons/blob/main/addons/nvidia/enable#L216

Are you interested in contributing with a fix?

The text was updated successfully, but these errors were encountered:

VariableDeclared · 2024-10-04T14:37:15Z

Opened PR: #305

Add workaround for canonical/microk8s-core-addons#303

This was referenced Oct 4, 2024

Update GPU operator version #304

Closed

Update Default GPU Operator Version #305

Merged

VariableDeclared added a commit to Barteus/demo-aws-mk8s-ckf-mlflow that referenced this issue Oct 7, 2024

Update README.md

5619205

Add workaround for canonical/microk8s-core-addons#303

NohaIhab mentioned this issue Oct 9, 2024

Try the NIMs and KServe guide canonical/bundle-kubeflow#1077

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU operator DKMS build failure on 22.04 #303

GPU operator DKMS build failure on 22.04 #303

VariableDeclared commented Oct 4, 2024

VariableDeclared commented Oct 4, 2024 •

edited

Loading

GPU operator DKMS build failure on 22.04 #303

GPU operator DKMS build failure on 22.04 #303

Comments

VariableDeclared commented Oct 4, 2024

Summary

Reproduction Steps

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

VariableDeclared commented Oct 4, 2024 • edited Loading

VariableDeclared commented Oct 4, 2024 •

edited

Loading