Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change priority for scheduling reroute during timeout #16445

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

imRishN
Copy link
Member

@imRishN imRishN commented Oct 23, 2024

Description

This PR updates the priority of scheduling reroute when timed out from HIGH to NORMAL. This is because consistent HIGH reroutes might starve NORMAL priority tasks. And moreover, NORMAL is right for reasonable clusters. For clusters in messed up state which is causing NORMAL priority tasks to starve, we add a new dynamic cluster setting to raise the priority of reroute task to allocate shards in such scenarios.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • [ ] Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Copy link
Contributor

❌ Gradle check result for 5e83a92: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Comment on lines 346 to 347
"reroute after existing shards allocator timed out",
Priority.HIGH,
"reroute after existing shards allocator [R] timed out",
Priority.NORMAL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a separate priority for primary vs replica?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NORMAL also seems right for PSA. But during genuine issues in the cluster which can be identified with appropriate monitoring, we might need to raise it to HIGH. I will update the PR with a similar setting for ESA similar to BSA to raise reroute priority. Wdyt?

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets update the PR description

@imRishN
Copy link
Member Author

imRishN commented Oct 23, 2024

Lets update the PR description

Updated

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Copy link
Contributor

❌ Gradle check result for 6a448d0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@imRishN imRishN changed the title Change priority for scheduling reroute in timeout Change priority for scheduling reroute during timeout Oct 23, 2024
Copy link
Contributor

❌ Gradle check result for 825a983: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Copy link
Contributor

❌ Gradle check result for 5368e7f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Copy link
Contributor

❌ Gradle check result for 2ba604d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants