Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancel Timeout Tasks and Fail tests if too many Timeouts #2204

Merged
merged 16 commits into from
Dec 15, 2023
Merged

Conversation

bfish713
Copy link
Collaborator

@bfish713 bfish713 commented Dec 12, 2023

Closes #2165

This PR:

  • Cancels the consensus timeout task as soon as we enter a new view (before we start the next timeout)
  • Keeps track of the view sync timeout task, and cancels it whenever we send a new view sync vote
  • Emits a ViewTimedout event to the application when the consensus timeout task expires
  • Fixes the test infra to properly account for view timeouts and fixes an issue where views were never counted as failed

This PR does not:

  • Properly set the num_failed_views parameter for all tests. I specifically set it very high so the tests will pass. This is consistent with prior behavior where the test infra never counted any failed views (num_failed_views was effectively infinite).

Key places to review:

  • Timeout cancellation logic in consensus.rs and view_sync.rs
  • Changes to OverallSafetyTask to fix bugs in the test harness

@bfish713 bfish713 linked an issue Dec 12, 2023 that may be closed by this pull request
2 tasks
@bfish713 bfish713 changed the title Bf/timeout task Cancel Timeout Tasks and Fail tests if too many Timeouts Dec 12, 2023
@bfish713 bfish713 marked this pull request as ready for review December 13, 2023 17:38
@bfish713 bfish713 self-assigned this Dec 13, 2023
@bfish713 bfish713 added this to the Sprint 6 milestone Dec 13, 2023
Copy link
Member

@shenkeyao shenkeyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I left some minor comments.

Comment on lines 345 to 348
#[cfg(async_executor_impl = "async-std")]
timeout_task.cancel().await;
#[cfg(async_executor_impl = "tokio")]
timeout_task.abort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we add a helper function to cancel the timeout task depending on their async executor, and use it to simplify places like this?

Comment on lines 236 to 239
let num_decided = self.success_nodes.len();
let num_failed = self.failed_nodes.len();
let remaining_nodes = total_num_nodes - (num_decided + num_failed);
remaining_nodes + num_decided >= threshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this function could be simplified as total_num_nodes - num_failed >= threshold, i.e., we don't need to calculate num_decided and remaining_nodes.

@@ -108,7 +108,10 @@ async fn test_with_failures_half_f() {
metadata.spinning_properties = SpinningTaskDescription {
node_changes: vec![(5, dead_nodes)],
};

// TODO: this should only have 3 failures for each down leader, investigate why it fails additional views
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is strange. It made sense before we set num_successful_views to 22, but now those 3 nodes should each fail only once.

@bfish713 bfish713 requested a review from shenkeyao December 14, 2023 14:24
shenkeyao
shenkeyao previously approved these changes Dec 14, 2023
@bfish713 bfish713 requested a review from shenkeyao December 15, 2023 15:52
@bfish713 bfish713 merged commit 8b4d979 into main Dec 15, 2023
7 of 8 checks passed
@bfish713 bfish713 deleted the bf/timeout-task branch December 15, 2023 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

STABILITY - Cancel Timeout Tasks properly
2 participants