Fix orch stuch when removing vlan member (#3294) #3295
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What I did
Fixes #3294
The root cause of this issue is the two data struct of vlan member info in orchagent is not in sync.
Why I did it
Fix the bug
How I did it
The return value of setPortPvid() is does not matter, so we ignore it.
So we get the m_members" and "m_portVlanMember" in sync all the time.
How I verified it
Details if related
When one deletes one interface from a vlan, then makes the interface to router interface(add a ip addr to interface)
We have no way to ensure that above 2 config arrives at Orch in order.
It's possible that the "creating a router interface" arrives at first. If so, then the "removing vlan member" will be failed.
There are 2 data struct storing the vlan member info in orchagent. (They should be in sync all the time.)
"class Port" in port.h
The instance of "class Port" is vlan interface. it's "m_members" is a set of vlan member, like EhternetXX, EthernetYY...
"class PortsOrch" in portsorch.h
The instance of "class PortsOrch" is EthernetXX. It's m_portVlanMember is a map of vlan info, like Vlan100, Vlan200...
Please take a look at PortsOrch::removeVlanMember().
if setPortPvid() retrun fail(because the port is already a router interface). So the "m_members" and "m_portVlanMember" will be not in sync.
In the next enter of removeVlanMember() with same params.
The iterator "vlan_member" is point to the end, but the assert() doesn't work because NOS is relase version.(if NOS is in debug version, the assert() will trigger abort())
When m_portVlanMember[port.m_alias].erase(vlan_member) erase this end iterator, the c++ std lib will be stuck, occupy CPU 100%.