Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry_on_status setting does not work as expected with requests that should not be retried immediately #197

Open
artem-shelkovnikov opened this issue Mar 22, 2024 · 3 comments

Comments

@artem-shelkovnikov
Copy link
Member

Steps to reproduce:

Happens for me on clusters of 8.12 and 8.13 with clients of same versions, I assume it happens in earlier clusters as well.

  1. Have a cluster that's rather small (I used 2GB RAM Elasticsearch in Elastic Cloud)
  2. Have a script that bombs this cluster with bulk requests that will eventually lead to this cluster returning 429 back.
  3. Have another script that just does search on this cluster with Elasticsearch client that is initialised with default retry_on_status - presumably, it should retry 429s.

What happens is - client in 3. actually retries 429, but fails almost immediately as it does not try to sleep with any sort of backoff strategy, so it retries a couple times in a row within a second and give up.

Expected behaviour
Client allows to retry with a backoff strategy. Additionally, if the client is closed, would be great if the pending retry immediately wakes up and exits without retrying (see elastic/elasticsearch-py#2484)

@pquentin
Copy link
Member

I transferred this issue from elasticsearch-py as retries are implemented in the transport. Implementation itself is simple enough. Without considering tests for now, this diff is enough:

diff --git a/elastic_transport/_async_transport.py b/elastic_transport/_async_transport.py
index 1e66f74..20e6d6a 100644
--- a/elastic_transport/_async_transport.py
+++ b/elastic_transport/_async_transport.py
@@ -47,6 +47,7 @@ from ._transport import (
     NOT_DEAD_NODE_HTTP_STATUSES,
     Transport,
     TransportApiResponse,
+    backoff_time,
     validate_sniffing_options,
 )
 from .client_utils import resolve_default
@@ -366,6 +367,7 @@ class AsyncTransport(Transport):
                 if not retry or attempt >= max_retries:
                     return TransportApiResponse(resp.meta, body)
                 else:
+                    await asyncio.sleep(backoff_time(attempt))
                     _logger.warning(
                         "Retrying request after non-successful status %d (attempt %d of %d)",
                         resp.meta.status,
diff --git a/elastic_transport/_transport.py b/elastic_transport/_transport.py
index bf1de58..9621806 100644
--- a/elastic_transport/_transport.py
+++ b/elastic_transport/_transport.py
@@ -86,6 +86,10 @@ class TransportApiResponse(NamedTuple):
     meta: ApiResponseMeta
     body: Any
 
+def backoff_time(attempts: int, base: int = 1, cap: int = 60):
+    # Full Jitter from https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
+    return random.uniform(0, min(cap, base * 2 ** (attempts - 1)))
+
 
 class Transport:
     """
@@ -444,6 +448,7 @@ class Transport:
                 if not retry or attempt >= max_retries:
                     return TransportApiResponse(resp.meta, body)
                 else:
+                    time.sleep(backoff_time(attempt))
                     _logger.warning(
                         "Retrying request after non-successful status %d (attempt %d of %d)",
                         resp.meta.status,

However, I have two concerns:

  • Is it considered a breaking change to suddenly sleep between retries? This introduces sleeps for all retries: on status on the same node, but also on timeouts/connection errors on another node.
  • Are the sleeping defaults good enough? I'm using the "Full Jitter" algorithm from https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ with base = 1 and cap = 60. If all retries get exhausted, 95% of total sleep times will be between ~60s and ~180s, with a median of ~120s.

Total sleep distribution

@artem-shelkovnikov
Copy link
Member Author

Is it considered a breaking change to suddenly sleep between retries? This introduces sleeps for all retries: on status on the same node, but also on timeouts/connection errors on another node.

For me it might be hard to say - if we're concerned about it, might be worth to make it configurable and make default setting emulate previous behaviour?

Are the sleeping defaults good enough? I'm using the "Full Jitter" algorithm from https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ with base = 1 and cap = 60. If all retries get exhausted, 95% of total sleep times will be between ~60s and ~180s, with a median of ~120s.

I think so. Especially since number of retries is configurable.

@pquentin
Copy link
Member

Thanks Artem! I also got feedback from @miguelgrinberg:

I'd say that the backoff_time function should be configurable, or at the least the base and cap arguments to it should be configurable. And maybe the default should be for them to be 0 so that the behavior does not change unless the user explicitly wants to use delays between retries.

I'm planning to make base and cap configurable (with better names like initial_sleep and maximum_sleep - to be confirmed). And will make sure that the default behavior does not change now (no sleep between retries).

It would also be nice to be able to say which status codes need a sleep between retries, but this is maybe overengineering.

I'll wait for an user to complain before having retries that sleep and other that don't. It is possible today to configure the status code that are retried and the ones that are not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants