-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement MlxResetFW to reset the FW on VF changes #733
feat: implement MlxResetFW to reset the FW on VF changes #733
Conversation
Thanks for your PR,
To skip the vendors CIs use one of:
|
Thanks for your PR,
To skip the vendors CIs use one of:
|
1673a95
to
83927ff
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
83927ff
to
05f9ca2
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
05f9ca2
to
7192b84
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
Pull Request Test Coverage Report for Build 10179580977Details
💛 - Coveralls |
7192b84
to
76a8f95
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
/test-all |
76a8f95
to
f18352a
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
f18352a
to
7202ecc
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
7202ecc
to
ac6f6ff
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
/test-all |
ac6f6ff
to
796da5a
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
@zeeke @adrianchiris the GH Action ci-triggers change seems to work. |
/test-all |
0bcc6fe
to
d4986d4
Compare
d4986d4
to
69b14ce
Compare
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general looks good.
I only have an issue on what can be the case if one of the pciAddress for reset doesn't work and we return an error.
and if we reset the primary nic of the system we will not have an API connection to reconcile and we can leave the system in a bad state...
return err | ||
} | ||
if vars.MlxPluginFwReset { | ||
return p.helpers.MlxResetFW(pciAddressesToReset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return an error here can be problematic if we have multiple PCIAddress and one them them failed (for example the primary one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand you correctly your idea would be to introduce a errs
variable to append the errors and return it at the end of the function, right?
69b14ce
to
0908678
Compare
Usually the reset of the fw should work. Also the code of the |
0908678
to
d80add9
Compare
Hi can you please rebase this one |
LGTM from my sided. just needs rebase |
Signed-off-by: Tobias Giese <tgiese@nvidia.com>
To not enable the new feature by default we want to add a feature flag first. Signed-off-by: Tobias Giese <tgiese@nvidia.com>
d80add9
to
08b524e
Compare
Rebased, thanks for the hint. Didn't noticed the wrong commits |
/test-all |
@SchSeba can you give this one a look ? would be great to get this one merged this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
nice work!
@tobiasgiese @adrianchiris do you know which models fail without this featureGate? We expect the Nvidia DGXH100 will fail to apply mstconfig settings without this change as well. |
@SalDaniele this feature originated from specific servers utilizing intel 4th generation Xeon "Eagle Stream" platform. why do you think DGX H100 will also fail to apply mstconfig settings without this change ? |
We received a customer case (OCPBUGS-42246) involving the DGX H100 platform with a cx-7 nic. Inspecting the logs, the card fails to update num vfs from the sriov config daemon on reboot, causing a boot loop. Cold boot of the system allows the vfs to be configured as expected. |
After further inspection, it is possible it is due to a conflict between the SNO and Nvidia Network Operator |
Todos:
mstfwreset
Manual test shows that the implementation is working: