`retry_on_status` setting does not work as expected with requests that should not be retried immediately #197

artem-shelkovnikov · 2024-03-22T14:23:04Z

Steps to reproduce:

Happens for me on clusters of 8.12 and 8.13 with clients of same versions, I assume it happens in earlier clusters as well.

Have a cluster that's rather small (I used 2GB RAM Elasticsearch in Elastic Cloud)
Have a script that bombs this cluster with bulk requests that will eventually lead to this cluster returning 429 back.
Have another script that just does search on this cluster with Elasticsearch client that is initialised with default retry_on_status - presumably, it should retry 429s.

What happens is - client in 3. actually retries 429, but fails almost immediately as it does not try to sleep with any sort of backoff strategy, so it retries a couple times in a row within a second and give up.

Expected behaviour
Client allows to retry with a backoff strategy. Additionally, if the client is closed, would be great if the pending retry immediately wakes up and exits without retrying (see elastic/elasticsearch-py#2484)

The text was updated successfully, but these errors were encountered:

pquentin · 2024-10-16T07:04:04Z

I transferred this issue from elasticsearch-py as retries are implemented in the transport. Implementation itself is simple enough. Without considering tests for now, this diff is enough:

diff --git a/elastic_transport/_async_transport.py b/elastic_transport/_async_transport.py
index 1e66f74..20e6d6a 100644
--- a/elastic_transport/_async_transport.py
+++ b/elastic_transport/_async_transport.py
@@ -47,6 +47,7 @@ from ._transport import (
     NOT_DEAD_NODE_HTTP_STATUSES,
     Transport,
     TransportApiResponse,
+    backoff_time,
     validate_sniffing_options,
 )
 from .client_utils import resolve_default
@@ -366,6 +367,7 @@ class AsyncTransport(Transport):
                 if not retry or attempt >= max_retries:
                     return TransportApiResponse(resp.meta, body)
                 else:
+                    await asyncio.sleep(backoff_time(attempt))
                     _logger.warning(
                         "Retrying request after non-successful status %d (attempt %d of %d)",
                         resp.meta.status,
diff --git a/elastic_transport/_transport.py b/elastic_transport/_transport.py
index bf1de58..9621806 100644
--- a/elastic_transport/_transport.py
+++ b/elastic_transport/_transport.py
@@ -86,6 +86,10 @@ class TransportApiResponse(NamedTuple):
     meta: ApiResponseMeta
     body: Any
 
+def backoff_time(attempts: int, base: int = 1, cap: int = 60):
+    # Full Jitter from https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
+    return random.uniform(0, min(cap, base * 2 ** (attempts - 1)))
+
 
 class Transport:
     """
@@ -444,6 +448,7 @@ class Transport:
                 if not retry or attempt >= max_retries:
                     return TransportApiResponse(resp.meta, body)
                 else:
+                    time.sleep(backoff_time(attempt))
                     _logger.warning(
                         "Retrying request after non-successful status %d (attempt %d of %d)",
                         resp.meta.status,

However, I have two concerns:

Is it considered a breaking change to suddenly sleep between retries? This introduces sleeps for all retries: on status on the same node, but also on timeouts/connection errors on another node.
Are the sleeping defaults good enough? I'm using the "Full Jitter" algorithm from https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ with base = 1 and cap = 60. If all retries get exhausted, 95% of total sleep times will be between ~60s and ~180s, with a median of ~120s.

artem-shelkovnikov · 2024-10-16T08:09:33Z

Is it considered a breaking change to suddenly sleep between retries? This introduces sleeps for all retries: on status on the same node, but also on timeouts/connection errors on another node.

For me it might be hard to say - if we're concerned about it, might be worth to make it configurable and make default setting emulate previous behaviour?

Are the sleeping defaults good enough? I'm using the "Full Jitter" algorithm from https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ with base = 1 and cap = 60. If all retries get exhausted, 95% of total sleep times will be between ~60s and ~180s, with a median of ~120s.

I think so. Especially since number of retries is configurable.

pquentin · 2024-10-17T06:42:32Z

Thanks Artem! I also got feedback from @miguelgrinberg:

I'd say that the backoff_time function should be configurable, or at the least the base and cap arguments to it should be configurable. And maybe the default should be for them to be 0 so that the behavior does not change unless the user explicitly wants to use delays between retries.

I'm planning to make base and cap configurable (with better names like initial_sleep and maximum_sleep - to be confirmed). And will make sure that the default behavior does not change now (no sleep between retries).

It would also be nice to be able to say which status codes need a sleep between retries, but this is maybe overengineering.

I'll wait for an user to complain before having retries that sleep and other that don't. It is possible today to configure the status code that are retried and the ones that are not.

pquentin mentioned this issue Oct 14, 2024

Wait for trained model to be fully allocated elastic/elasticsearch-clients-tests#145

Merged

pquentin transferred this issue from elastic/elasticsearch-py Oct 16, 2024

pquentin mentioned this issue Oct 16, 2024

Fix async cluster/cluster_info failure elastic/elasticsearch-serverless-python#63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`retry_on_status` setting does not work as expected with requests that should not be retried immediately #197

`retry_on_status` setting does not work as expected with requests that should not be retried immediately #197

artem-shelkovnikov commented Mar 22, 2024

pquentin commented Oct 16, 2024

artem-shelkovnikov commented Oct 16, 2024

pquentin commented Oct 17, 2024

retry_on_status setting does not work as expected with requests that should not be retried immediately #197

retry_on_status setting does not work as expected with requests that should not be retried immediately #197

Comments

artem-shelkovnikov commented Mar 22, 2024

pquentin commented Oct 16, 2024

artem-shelkovnikov commented Oct 16, 2024

pquentin commented Oct 17, 2024

`retry_on_status` setting does not work as expected with requests that should not be retried immediately #197

`retry_on_status` setting does not work as expected with requests that should not be retried immediately #197