-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backing: Back-off from backing on approval checking lag #2908
Conversation
If approval checking falls behind a certain threshold, it means that the network could not process the assignments and approvals needed for approving the block fast enough, so we need to back-off from creating new work, to give the opportunity of the approvals subsystems to catch up. Continously, creating new work is not a good idea because of the way approvals subsystems work, so if the system is slow on processing the assignments and approvals for the current block, either because we are behind on work from previous blocks or because the network is slow, validators will simply trigger new tranches which in turn causes more delays so we are going to create the conditions for the system to never catch up and fall behind. Hence, why we need a mechanism to ensure that instead of falling more and more behind we actually allow the system to automatically catch up and start working in optimal conditions. This PR achieves that, by abstaining from backing new candidates if the node is behind on approvals beyond a certain threshold. Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change we are still paying the costs of executing candidates in backing and dealing with statement gossip while we don't put them on chain. Even if there is some parameter and selection of backed candidates, this approach only considers the view of a single node.
Ideally, if we include finality proofs on chain then we can let the runtime do the job of selecting which candidates to back (for example always back system parachains).
As a smaller step I would recommend to skip backing the core the node is assigned to when the approval checking lag is too high. We are experimenting with decreasing backing group size. With a backing group of 3 we reasonably expect this form of back pressure is efficient and related to the actual network performance/load, but still falls short on being able to back only specific candidates.
Undeniably there is a need for a back-off mechanism. Also if there will be a back-off mechanism it definitely should limit the the entry-point of more candidates which is backing. Those two aspects I think pretty much everyone agrees on. As Sandreim points out:
This is also a problem because it effectively makes it an opt-in back-off scheme. Malicious nodes can continue backing. This is an issue when considering cases where all honest nodes back-off due to a large approval-lag / big approval-queue but malicious nodes continue backing and they lengthen this backed-off state on purpose because it allows them to monopolize backing. Fundamentally this issue is very very similar to the discussions around the finality lag and block-authoring back-off. I have very similar fears. I'm definitely not saying that finality proofs are the only way, but they are the nicest and cleanest way. That allows for easy audits and hopefully less mistakes on weird edge-cases as well as no concerns around balancing opt-in systems which can be abused. |
The CI pipeline was cancelled due to failure one of the required jobs. |
Try 2Changes:
Addressing concerns:Use on-chain finality proofs.Agree that having finality proofs on-chain would allow us to implement this in the cleanest and safest way, where we can prove that all nodes respected the behaviours, however that is a longer-term solution that would take some significant efort and resources to implement, in the meanwhile I do think that having this short-term solution would give us enough benefits and they won't make the situation worse. What if malicious nodes don't comply ?Would the mechanism achieve its purpose ?Our network assumes 2/3 honest validators, so I would say that should be enough to allow nodes to catch up on work, we have bigger problems if we can't handle approving work backed by only 1/3 of the nodes. Can malicious nodes control/lenghten this state somehow ?
What will uncompliant nodes gain from this ?Not much, since this just prevents honest validators from Seconding, they would still be allow to issue Valid statements if a candidate has been Seconded(done by malicious nodes), so there is no change in the mechanism of checking the integrity of a candidate and also in the era points rewards, since both Seconded and Valid statements are rewarded the same. How does affect liveness of System Parachains ?Since System Parachains are doing important work on behalf of the network, this PR proposes that we make an exception for System Parachains and continue seconding them. @Overkillus @sandreim @eskimor Let me know what you think. |
As mentioned this is only an interim solution, which should be improved once we have finality proofs on chain. Hence I would love comments in the code referencing the corresponding ticket and a hint in the ticket that this code should be replaced once we have proofs: For future readers it should be 100% clear when this code becomes obsolete, anyone implementing on-chain proofs should know that this code can now be removed/replaced. |
@alexggh do we still want this? |
If approval checking falls behind a certain threshold, it means that the network could not process the assignments and approvals needed for approving the block fast enough, so we need to back-off from creating new work, to give the opportunity of the approvals subsystems to catch up.
Continuously, creating new work is not a good idea because of the way approvals subsystems work, so if the system is slow on processing the assignments and approvals for the current block, either because we are behind on work from previous blocks or because the network is slow, validators will simply trigger new tranches which in turn causes more delays so we are going to create the conditions for the system to never catch up and fall behind.
Hence, why we need a mechanism to ensure that instead of falling more and more behind we actually allow the system to automatically catch up and start working in optimal conditions. This PR achieves that, by abstaining from backing new candidates if the node is behind on approvals beyond a certain threshold.
TODO:
[ ] Is approval checking lag the right signal for deciding to back-off on backing ?
[ ] Test behaviour in real-world conditions