Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a couple issues in change_lag_lacp_timer #15778

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

arista-nwolfe
Copy link
Contributor

#14469 was introduced to increase the LACP timeout of EOS neighbors during QOS tests.
This was required because the QOS tests would disable TX credits on the test's dest ports, without TX credits LACP packets won't egress and eventually the LAG will be brought down by the EOS neighbor due to LACP timeout.

This change adds a couple improvements to that existing PR:
-Updated to work with single-asic LCs
-Updated to change the LACP timeout multiplier for all 3 dst_port_id neighbors.

Also made testQosSaiPfcXonLimit use fixture change_lag_lacp_timer

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405

Updated to work with single-asic LCs
Updated to work on all 3 dst_port_ids
Also made testQosSaiPfcXonLimit use fixture change_lag_lacp_timer
@rlhui rlhui requested a review from wenyiz2021 November 27, 2024 22:22
for port_ch, port_intf in dst_mgfacts['minigraph_portchannels'].items():
for member in port_intf['members']:
if member in dst_interfaces:
lag_names.append( port_ch )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we get lag_names on all dst device, regardless of single-asic/multi-asic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some QOS tests send packets to multiple destinations dst_port_id, dst_port_2_id, dst_port_3_id
We disable TX on all 3 of these ports:
https://github.com/sonic-net/sonic-mgmt/blob/master/tests/saitests/py3/sai_qos_tests.py#L2543
And any of these 3 ports could be a LAG so therefor needs to have the LACP multiplier increased on the neighboring device.

@wenyiz2021
Copy link
Contributor

@arista-nwolfe , could you paste the error/failures without this PR?
also can you run it on a T0 topo as well, to make sure it won't fail on single-asic device?

@arista-nwolfe
Copy link
Contributor Author

@arista-nwolfe , could you paste the error/failures without this PR? also can you run it on a T0 topo as well, to make sure it won't fail on single-asic device?

The failure will likely manifest as something like:
AssertionError: unexpectedly PFC counter not increase, after send a few packets to trigger PFC

This is because one of the destination port-channels went down too early causing the packets that we enqueued to be dropped instead of stored in the destination port's buffer.
Without the packet in the buffer we don't hit the xoff threshold and don't trigger PFCs.

@arista-nwolfe
Copy link
Contributor Author

@arista-nwolfe , could you paste the error/failures without this PR? also can you run it on a T0 topo as well, to make sure it won't fail on single-asic device?

Regarding the T0 results, all this code is guarded to only run on broadcom-dnx:
dutTestParams["basicParams"]["platform_asic"] == "broadcom-dnx"
So because we have no broadcom-dnx platforms at the T0 level it should have no effect.

@bingwang-ms
Copy link
Collaborator

@XuChen-MSFT Can you please help review?

for port_ch, port_intf in dst_mgfacts['minigraph_portchannels'].items():
for member in port_intf['members']:
if member in dst_interfaces:
lag_names.append( port_ch )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for any two dst ports to be members of the same port channel? Then it would be better to use a set rather than a list as the lag_names variable type.

@wenyiz2021
Copy link
Contributor

@arista-nwolfe , could you paste the error/failures without this PR? also can you run it on a T0 topo as well, to make sure it won't fail on single-asic device?

Regarding the T0 results, all this code is guarded to only run on broadcom-dnx: dutTestParams["basicParams"]["platform_asic"] == "broadcom-dnx" So because we have no broadcom-dnx platforms at the T0 level it should have no effect.

oh make sense, I ignored the dnx check, so it will only impact T2 layer.
This fix will also not impact if all LCs are MA, right? it only impact single-asic&MA combination
functionally LGTM, please resolve Xu Chen's comment.

@arista-nwolfe
Copy link
Contributor Author

@arista-nwolfe , could you paste the error/failures without this PR? also can you run it on a T0 topo as well, to make sure it won't fail on single-asic device?

Regarding the T0 results, all this code is guarded to only run on broadcom-dnx: dutTestParams["basicParams"]["platform_asic"] == "broadcom-dnx" So because we have no broadcom-dnx platforms at the T0 level it should have no effect.

oh make sense, I ignored the dnx check, so it will only impact T2 layer. This fix will also not impact if all LCs are MA, right? it only impact single-asic&MA combination functionally LGTM, please resolve Xu Chen's comment.

The only part that will impact a fully multi-asic system is the increasing of LACP multiplier for dst_port_2_id and dst_port_3_id. Previously it was only doing this for dst_port_id.

@rlhui
Copy link

rlhui commented Dec 13, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@wenyiz2021
Copy link
Contributor

please resolve precommit check

tests/qos/qos_sai_base.py:2604:13: F841 local variable 'src_dut' is assigned to but never used
tests/qos/qos_sai_base.py:2615:42: E201 whitespace after '('
tests/qos/qos_sai_base.py:2615:50: E202 whitespace before ')'

@rlhui
Copy link

rlhui commented Dec 18, 2024

please resolve precommit check

tests/qos/qos_sai_base.py:2604:13: F841 local variable 'src_dut' is assigned to but never used tests/qos/qos_sai_base.py:2615:42: E201 whitespace after '(' tests/qos/qos_sai_base.py:2615:50: E202 whitespace before ')'

@arista-nwolfe

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rlhui rlhui added the chassis label Dec 18, 2024
@rlhui
Copy link

rlhui commented Dec 18, 2024

@ansrajpu-git would you please also review this? thanks.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@arlakshm
Copy link
Contributor

/Azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ansrajpu-git
Copy link
Contributor

@arlakshm @arista-nwolfe , adding the same fixture for few more qos tests. Please review PR # #16224

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

8 participants