Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on PCI Passthrough using new L40 Openshift 4.1x #548

Closed
clrfuerst opened this issue Jul 11, 2023 · 2 comments
Closed

Error on PCI Passthrough using new L40 Openshift 4.1x #548

clrfuerst opened this issue Jul 11, 2023 · 2 comments

Comments

@clrfuerst
Copy link

1. Quick Debug Checklist

  • OpenShift 4.10
  • NVIDIA GPU Operator 23.3.2
  • OpenShift Virtualization 4.10.9

kubevirt-hyperconfig

spec:
permittedHostDevices:
pciHostDevices:
- resourceName: "nvidia.com/GA102GL_A40"
pciDeviceSelector: "10DE:2235"
externalResourceProvider: true
- resourceName: "nvidia.com/26b5"
pciDeviceSelector: "10DE:26B5"
externalResourceProvider: true

oc describe node XXXX
Capacity:
nvidia.com/26b5: 1
Allocatable:
nvidia.com/26b5: 1

1. Issue or feature description

Getting the following error when trying to use a L40 GPU with PCI Passthrough to a Virtual Machine - which then won't assign the GPU or start the VM.

From the nvidia-sandbox-device-plugin-daemonset
2023/07/10 19:41:03 Nvidia device 0000:e2:00.0
2023/07/10 19:41:03 Iommu Group 128
2023/07/10 19:41:03 Device Id 26b5
2023/07/10 19:41:03 Error accessing file path "/sys/bus/mdev/devices": lstat /sys/bus/mdev/devices: no such file or directory
2023/07/10 19:41:03 Iommu Map map[128:[{0000:e2:00.0}]]
2023/07/10 19:41:03 Device Map map[26b5:[128]]
2023/07/10 19:41:03 vGPU Map map[]
2023/07/10 19:41:03 GPU vGPU Map map[]
2023/07/10 19:41:03 Error: Could not find device name for device id: 26b5
2023/07/10 19:41:03 DP Name 26b5
2023/07/10 19:41:03 Devicename 26b5
2023/07/10 19:41:03 26b5 Device plugin server ready

virt-launcher pod trying to allocate the device
server error. command SyncVMI failed: "failed to create GPU host-devices: the number of GPU/s do not match the number of devices:\nGPU: [{26b5 nvidia.com/26b5 }]\nDevice: []"

{"component":"virt-launcher","level":"warning","msg":"PCI_RESOURCE_NVIDIA_COM_26B5 not set for resource nvidia.com/26b5","pos":"addresspool.go:50","timestamp":"2023-07-11T16:11:34.667518Z"}

2. Steps to reproduce the issue

Trying to launch a VM using an L40 GPU vs an A40 GPU using pci-passthrough

@cdesiniotis
Copy link
Contributor

@clrfuerst can you try using the latest kubevirt-gpu-device-plugin image, v1.2.2? Set sandboxDevicePlugin.version=v1.2.2 in ClusterPolicy. Note, the pci id database was updated in v1.2.2 so the L40 GPU should be named with its device name (rather than device id) -- you will have to update your hyperconverged configuration accordingly.

@clrfuerst
Copy link
Author

Thank you for the pointer, this seems to have done the trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants