Change priority for scheduling reroute during timeout #16445

imRishN · 2024-10-23T05:07:51Z

Description

This PR updates the priority of scheduling reroute when timed out from HIGH to NORMAL. This is because consistent HIGH reroutes might starve NORMAL priority tasks. And moreover, NORMAL is right for reasonable clusters. For clusters in messed up state which is causing NORMAL priority tasks to starve, we add a new dynamic cluster setting to raise the priority of reroute task to allocate shards in such scenarios.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

~~[ ] Functionality includes testing.~~
~~[ ] API changes companion pull request created, if applicable.~~
~~[ ] Public documentation issue/PR created, if applicable.~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

github-actions · 2024-10-23T05:22:27Z

❌ Gradle check result for 5e83a92: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Bukhtawar · 2024-10-23T06:03:30Z

server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java

-                            "reroute after existing shards allocator timed out",
-                            Priority.HIGH,
+                            "reroute after existing shards allocator [R] timed out",
+                            Priority.NORMAL,


Should we have a separate priority for primary vs replica?

NORMAL also seems right for PSA. But during genuine issues in the cluster which can be identified with appropriate monitoring, we might need to raise it to HIGH. I will update the PR with a similar setting for ESA similar to BSA to raise reroute priority. Wdyt?

Bukhtawar

Lets update the PR description

imRishN · 2024-10-23T06:49:50Z

Lets update the PR description

Updated

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

github-actions · 2024-10-23T15:16:11Z

❌ Gradle check result for 6a448d0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

github-actions · 2024-10-23T16:50:21Z

❌ Gradle check result for 825a983: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

github-actions · 2024-10-23T20:09:53Z

❌ Gradle check result for 5368e7f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

github-actions · 2024-10-24T03:58:35Z

❌ Gradle check result for 2ba604d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Change priority for scheduling reroute in timeout

5e83a92

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Bukhtawar reviewed Oct 23, 2024

View reviewed changes

imRishN added the skip-changelog label Oct 23, 2024

Add setting for ESA

6a448d0

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

imRishN marked this pull request as ready for review October 23, 2024 14:39

imRishN requested review from anasalkouz, andrross, ashking94, CEHENKLE, dblock, dbwiddis, gbbafna, jainankitk, kotwanikunal, linuxpi, mch2, msfroh, nknize, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami and VachaShah as code owners October 23, 2024 14:39

Fix tests

825a983

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

imRishN changed the title ~~Change priority for scheduling reroute in timeout~~ Change priority for scheduling reroute during timeout Oct 23, 2024

Bukhtawar approved these changes Oct 23, 2024

View reviewed changes

opensearch-ci-bot mentioned this pull request Oct 23, 2024

[AUTOCUT] Gradle Check Flaky Test Report for RemotePrimaryLocalRecoveryIT #14314

Open

Trigger Build

5368e7f

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Trigger Build

2ba604d

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

This was referenced Oct 24, 2024

[AUTOCUT] Gradle Check Flaky Test Report for RemoteFsTimestampAwareTranslogTests #15818

Open

[AUTOCUT] Gradle Check Flaky Test Report for SearchRestCancellationIT #14311

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change priority for scheduling reroute during timeout #16445

Change priority for scheduling reroute during timeout #16445

imRishN commented Oct 23, 2024 •

edited

Loading

github-actions bot commented Oct 23, 2024

Bukhtawar Oct 23, 2024

imRishN Oct 23, 2024

Bukhtawar left a comment

imRishN commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 24, 2024

Change priority for scheduling reroute during timeout #16445

Are you sure you want to change the base?

Change priority for scheduling reroute during timeout #16445

Conversation

imRishN commented Oct 23, 2024 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Oct 23, 2024

Bukhtawar Oct 23, 2024

Choose a reason for hiding this comment

imRishN Oct 23, 2024

Choose a reason for hiding this comment

Bukhtawar left a comment

Choose a reason for hiding this comment

imRishN commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 24, 2024

imRishN commented Oct 23, 2024 •

edited

Loading