Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement MlxResetFW to reset the FW on VF changes #733

Merged
merged 2 commits into from
Aug 12, 2024

Conversation

tobiasgiese
Copy link
Contributor

@tobiasgiese tobiasgiese commented Jul 11, 2024

Todos:

Manual test shows that the implementation is working:

sriov-network-config-daemon 2024-07-11T14:39:36.793959572Z    INFO    mellanox/mellanox_plugin.go:190 mellanox-plugin: resetFW()      {"cmd-args": ["-d", "0000:03:00.0", "-l", "3", "-y", "reset"]}
foo@bar:~$ sudo mlxconfig -d /dev/mst/mt4119_pciconf0 q | grep -e SRIOV_EN -e NUM_OF_VFS
        NUM_OF_VFS                                  4
        SRIOV_EN                                    True(1)

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@coveralls
Copy link

coveralls commented Jul 15, 2024

Pull Request Test Coverage Report for Build 10179580977

Details

  • 1 of 63 (1.59%) changed or added relevant lines in 6 files are covered.
  • 6 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.2%) to 43.816%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/daemon/daemon.go 1 7 14.29%
pkg/helper/mock/mock_helper.go 0 10 0.0%
pkg/plugins/mellanox/mellanox_plugin.go 0 10 0.0%
pkg/vendors/mellanox/mock/mock_mellanox.go 0 10 0.0%
cmd/sriov-network-config-daemon/start.go 0 13 0.0%
pkg/vendors/mellanox/mellanox.go 0 13 0.0%
Files with Coverage Reduction New Missed Lines %
controllers/drain_controller.go 1 68.06%
pkg/plugins/mellanox/mellanox_plugin.go 5 7.64%
Totals Coverage Status
Change from base Build 10178846427: -0.2%
Covered Lines: 6536
Relevant Lines: 14917

💛 - Coveralls

api/v1/sriovoperatorconfig_types.go Outdated Show resolved Hide resolved
pkg/vars/vars.go Outdated Show resolved Hide resolved
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@tobiasgiese
Copy link
Contributor Author

/test-all

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@tobiasgiese
Copy link
Contributor Author

/test-all

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@tobiasgiese
Copy link
Contributor Author

@zeeke @adrianchiris the GH Action ci-triggers change seems to work.
After a rebase and push no new comment has been added.
But after closing and reopening there is one. 👍🏻

@tobiasgiese
Copy link
Contributor Author

/test-all

@tobiasgiese tobiasgiese force-pushed the impl-mlx-fw-reset branch 3 times, most recently from 0bcc6fe to d4986d4 Compare July 24, 2024 12:09
@tobiasgiese
Copy link
Contributor Author

/test-all

@tobiasgiese tobiasgiese changed the title [WIP] feat: implement MlxResetFW to reset the FW on VF changes feat: implement MlxResetFW to reset the FW on VF changes Jul 24, 2024
Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general looks good.

I only have an issue on what can be the case if one of the pciAddress for reset doesn't work and we return an error.

and if we reset the primary nic of the system we will not have an API connection to reconcile and we can leave the system in a bad state...

pkg/consts/constants.go Show resolved Hide resolved
pkg/daemon/daemon.go Outdated Show resolved Hide resolved
return err
}
if vars.MlxPluginFwReset {
return p.helpers.MlxResetFW(pciAddressesToReset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return an error here can be problematic if we have multiple PCIAddress and one them them failed (for example the primary one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand you correctly your idea would be to introduce a errs variable to append the errors and return it at the end of the function, right?

@tobiasgiese
Copy link
Contributor Author

and if we reset the primary nic of the system we will not have an API connection to reconcile and we can leave the system in a bad state...

Usually the reset of the fw should work. Also the code of the mstfwrest binary is tested. Additionally, we have a feature gate top opt-in, which means this is not enabled for the default usage. IMHO we should rely on the robustness of the reset.
Or have you already had experience with the reset and that's why concerns?
WDYT @adrianchiris

Dockerfile.sriov-network-config-daemon Show resolved Hide resolved
cmd/sriov-network-config-daemon/start.go Outdated Show resolved Hide resolved
pkg/daemon/daemon.go Outdated Show resolved Hide resolved
pkg/vendors/mellanox/mellanox.go Show resolved Hide resolved
pkg/vendors/mellanox/mellanox.go Outdated Show resolved Hide resolved
@SchSeba
Copy link
Collaborator

SchSeba commented Aug 4, 2024

Hi can you please rebase this one

@adrianchiris
Copy link
Collaborator

LGTM from my sided. just needs rebase

Signed-off-by: Tobias Giese <tgiese@nvidia.com>
To not enable the new feature by default we want to add a feature flag first.

Signed-off-by: Tobias Giese <tgiese@nvidia.com>
@tobiasgiese
Copy link
Contributor Author

Rebased, thanks for the hint. Didn't noticed the wrong commits

@tobiasgiese
Copy link
Contributor Author

/test-all

@adrianchiris
Copy link
Collaborator

@SchSeba can you give this one a look ? would be great to get this one merged this week.

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

nice work!

@SchSeba SchSeba merged commit 88017db into k8snetworkplumbingwg:master Aug 12, 2024
13 of 14 checks passed
@tobiasgiese tobiasgiese deleted the impl-mlx-fw-reset branch August 12, 2024 12:01
@SalDaniele
Copy link
Contributor

@tobiasgiese @adrianchiris do you know which models fail without this featureGate? We expect the Nvidia DGXH100 will fail to apply mstconfig settings without this change as well.

@adrianchiris
Copy link
Collaborator

@SalDaniele this feature originated from specific servers utilizing intel 4th generation Xeon "Eagle Stream" platform.
where upon reboot it does not send PCIe PERST# signal to the NIC.

why do you think DGX H100 will also fail to apply mstconfig settings without this change ?

@SalDaniele
Copy link
Contributor

We received a customer case (OCPBUGS-42246) involving the DGX H100 platform with a cx-7 nic. Inspecting the logs, the card fails to update num vfs from the sriov config daemon on reboot, causing a boot loop. Cold boot of the system allows the vfs to be configured as expected.

@SalDaniele
Copy link
Contributor

After further inspection, it is possible it is due to a conflict between the SNO and Nvidia Network Operator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants