-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QAT plugin crashes when device is detected, but not fully configured #1571
Comments
The operator is not HW aware but it just deploys the device plugin daemonSets based on user provided resources. The CRD has two fields relevant to your questions:
|
There could be another NFD label rule for VF module (assuming the related features are all configured as modules on distro kernels or loaded as DKMS), and that label can be added to the NFD label rule can also check PCI device (On quick check did not find when that was added, but it's there e.g. inNFD v0.12 docs.) |
Thanks, this is great information to have. Combining this it is possible to ignore certain nodes, and also to ignore nodes that are not configured (such as for sriov). Should this be added to some section of documentation (such as the end for This section of the README in QAT)? |
Can you do some testing with such rules? If that works fine, project NFD rules overlay can be updated, and QAT example changed to use the new label. I think the relation between label used in nodeSelector, the driver variant selected in the same YAML file, and what options there are for those, is better documented directly in the example file comments. README can then have more generic note about that, e.g. "Example file documents relations between labels given for nodeSelector, and the requested QAT attributes". |
We have plans to improve the documentation on HW initialization and troubleshooting (#1555) and we'll keep this feedback in mind. On the error itself:
'Permission denied' is often linked with the error where the plugin runs on an Ubuntu node which has Apparmor enabled. For that, the "fix" is to deploy the plugin with |
Could we just print the error and not abort? It would increase the feedback loop for cases where a node should get QAT resources but is incorrectly configured. |
For the NFD rule, we could also add checks for vfio-pci to be either loaded as a module (kernel.loadedmodule) or built-in (kernel.enabledmodule). |
Is this going to be a helpful selector? There's several states in how sriov/vfio-pci may not be in a configured state needed by the plugin. We can at least hope that with all of these present, QAT plugin can attempt to be successful in initializing, and not render the node unschedulable. |
I thought EDIT: with NFD rule providing |
I'm not able to follow this part. A plugin failing won't make the node unschedulable (other than workloads requesting the plugin resources won't land there). |
While the error reported above only crashes the pod (can also lead to a node labelled as degraded via a health monitor, for example) , the (mis)configuration of QAT can cause a fatal node error (until the qat plugin is removed). (I won't derail this conversation here, but I have seen this happen in my test cluster) At the minimum we need to skip nodes that don't have a chance to be configured for qat plugin (ie sriov not properly configured / iommu / etc) and define nodes (via labels) to skip deployment to (for various other reasons, like being a control plane node). So apart from the NFD rules we will set that check the configuration (as above), is it possible to just set the label |
Perhaps use NFD Tainting Feature?. |
looks good, thanks for the update - can close this |
@brgavino thanks for the confirmation, closing |
A QAT device may be present on a node, but may only be configured with the kernel driver - or SR-IOV mode may not have been enabled. The operator will do a scan of the available PCI devices but does not check if any of the further pre-requisites as described in the QAT plugin pre-requisites section (possibly, vfio-pci driver loaded, vfs enabled on QAT device, etc).
This could be the case when a QAT device is present in the system, but will not be available for node resource allocation and exposed to the cluster - but other nodes may have configured QAT devices available.
It is unclear how to limit the deployment of plugins via the operator to avoid nodes with available, unconfigured devices on the cluster without disabling deployment of QAT plugin to the whole cluster. In this case the operator should continue deployment of other detected devices and avoid attempting to deploy plugins to unconfigured, but installed nodes.
Please advise on correct approach/behavior
Example failure of QAT plugin when device is present via lspci, but not intended to be configured on the node.
The text was updated successfully, but these errors were encountered: