Error on PCI Passthrough using new L40 Openshift 4.1x #548

clrfuerst · 2023-07-11T16:20:48Z

1. Quick Debug Checklist

OpenShift 4.10
NVIDIA GPU Operator 23.3.2
OpenShift Virtualization 4.10.9

kubevirt-hyperconfig

spec:
permittedHostDevices:
pciHostDevices:
- resourceName: "nvidia.com/GA102GL_A40"
pciDeviceSelector: "10DE:2235"
externalResourceProvider: true
- resourceName: "nvidia.com/26b5"
pciDeviceSelector: "10DE:26B5"
externalResourceProvider: true

oc describe node XXXX
Capacity:
nvidia.com/26b5: 1
Allocatable:
nvidia.com/26b5: 1

1. Issue or feature description

Getting the following error when trying to use a L40 GPU with PCI Passthrough to a Virtual Machine - which then won't assign the GPU or start the VM.

From the nvidia-sandbox-device-plugin-daemonset
2023/07/10 19:41:03 Nvidia device 0000:e2:00.0
2023/07/10 19:41:03 Iommu Group 128
2023/07/10 19:41:03 Device Id 26b5
2023/07/10 19:41:03 Error accessing file path "/sys/bus/mdev/devices": lstat /sys/bus/mdev/devices: no such file or directory
2023/07/10 19:41:03 Iommu Map map[128:[{0000:e2:00.0}]]
2023/07/10 19:41:03 Device Map map[26b5:[128]]
2023/07/10 19:41:03 vGPU Map map[]
2023/07/10 19:41:03 GPU vGPU Map map[]
2023/07/10 19:41:03 Error: Could not find device name for device id: 26b5
2023/07/10 19:41:03 DP Name 26b5
2023/07/10 19:41:03 Devicename 26b5
2023/07/10 19:41:03 26b5 Device plugin server ready

virt-launcher pod trying to allocate the device
server error. command SyncVMI failed: "failed to create GPU host-devices: the number of GPU/s do not match the number of devices:\nGPU: [{26b5 nvidia.com/26b5 }]\nDevice: []"

{"component":"virt-launcher","level":"warning","msg":"PCI_RESOURCE_NVIDIA_COM_26B5 not set for resource nvidia.com/26b5","pos":"addresspool.go:50","timestamp":"2023-07-11T16:11:34.667518Z"}

2. Steps to reproduce the issue

Trying to launch a VM using an L40 GPU vs an A40 GPU using pci-passthrough

cdesiniotis · 2023-07-14T03:34:08Z

@clrfuerst can you try using the latest kubevirt-gpu-device-plugin image, v1.2.2? Set sandboxDevicePlugin.version=v1.2.2 in ClusterPolicy. Note, the pci id database was updated in v1.2.2 so the L40 GPU should be named with its device name (rather than device id) -- you will have to update your hyperconverged configuration accordingly.

clrfuerst · 2023-07-14T15:02:00Z

Thank you for the pointer, this seems to have done the trick.

clrfuerst closed this as completed Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on PCI Passthrough using new L40 Openshift 4.1x #548

Error on PCI Passthrough using new L40 Openshift 4.1x #548

clrfuerst commented Jul 11, 2023

cdesiniotis commented Jul 14, 2023

clrfuerst commented Jul 14, 2023

Error on PCI Passthrough using new L40 Openshift 4.1x #548

Error on PCI Passthrough using new L40 Openshift 4.1x #548

Comments

clrfuerst commented Jul 11, 2023

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

cdesiniotis commented Jul 14, 2023

clrfuerst commented Jul 14, 2023