Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC: Allow retries of up to MAX_MSG_SIZE #347

Merged
merged 1 commit into from
Jun 4, 2024

Conversation

daverigby
Copy link
Contributor

@daverigby daverigby commented May 15, 2024

Problem

gRPC has a built-in retry mechanism1 which we configure to automatically retry on status UNAVAILABLE messages from Pinecone.

However, it has been observed that VectorService/Upsert method is not being retried automatically and causes an exception to be thrown to the application:

Traceback (most recent call last):
  File ".venv/lib/python3.11/site-packages/pinecone/grpc/base.py", line 150, in wrapped
return func(
       ^^^^^
  File ".venv/lib64/python3.11/site-packages/grpc/_channel.py", line 1181, in __call__
return _end_unary_response_blocking(state, call, False, None)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib64/python3.11/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "unavailable"
    debug_error_string = "UNKNOWN:Error received from peer ipv4:34.223.120.220:443 {created_time:"2024-05-10T11:54:43.047741403+00:00", grpc_status:14, grpc_message:"unavailable"}"

Enabling gRPC's tracing2 by setting env vars 'GRPC_VERBOSITY=debug GRPC_TRACE=all' (warning - this is very verbose!) highlighted that when we do get an StatusCode.UNAVAILABLE, retry is not considered as the request is too large ("committing" in this context means it effectively disables retry attempts):

0514 14:00:43.870499051 4093173 retry_filter_legacy_call_data.cc:1855] chand=0x7ff708006080 calld=0x56377b0b11e0: exceeded retry buffer size, committing

As per gRPC's options3, the max buffer size is controlled via:

/** Per-RPC retry buffer size, in bytes. Default is 256 KiB. */
#define GRPC_ARG_PER_RPC_RETRY_BUFFER_SIZE "grpc.per_rpc_retry_buffer_size"

Given Upsert messages are frequently larger than 256KiB (it is common to batch up to the 2 MB limit), we will fail to retry any batches larger than 256kB.

Solution

Address this by changing the retry buffer size to the same size as the maximum message we support (currently 128MB, more than sufficient to retry any UpsertRequest).

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Test Plan

No existing test infra to automate testing of this (no way to do error injection); manually verified that previously seen (intermittent) UNAVAILABLE responses are correctly retried.

gRPC has a built-in retry mechanism[1] which we configure to
automatically retry on status UNAVAILABLE messages from Pinecone.

However, it has been observed that VectorService/Upsert method is
_not_ being retried automatically and causes an exception to be thrown
to the application:

    Traceback (most recent call last):
      File ".venv/lib/python3.11/site-packages/pinecone/grpc/base.py", line 150, in wrapped
	return func(
	       ^^^^^
      File ".venv/lib64/python3.11/site-packages/grpc/_channel.py", line 1181, in __call__
	return _end_unary_response_blocking(state, call, False, None)
	       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File ".venv/lib64/python3.11/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
	raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	    status = StatusCode.UNAVAILABLE
	    details = "unavailable"
	    debug_error_string = "UNKNOWN:Error received from peer ipv4:34.223.120.220:443 {created_time:"2024-05-10T11:54:43.047741403+00:00", grpc_status:14, grpc_message:"unavailable"}"

Enabling gRPC's tracing[2] by setting env vars 'GRPC_VERBOSITY=debug
GRPC_TRACE=all' (warning - this is _very_ verbose!) highlighted that
when we do get an StatusCode.UNAVAILABLE, retry is not considered as
the request is too large ("committing" in this context means it
effectively disables retry attempts):

    0514 14:00:43.870499051 4093173 retry_filter_legacy_call_data.cc:1855] chand=0x7ff708006080 calld=0x56377b0b11e0: exceeded retry buffer size, committing

As per gRPC's options[3], the max buffer size is controlled via:

    /** Per-RPC retry buffer size, in bytes. Default is 256 KiB. */
    #define GRPC_ARG_PER_RPC_RETRY_BUFFER_SIZE "grpc.per_rpc_retry_buffer_size"

Given Upsert messages are frequently larger than 256KiB (it is common
to batch up to the 2 MB limit), we will fail to retry any batches
larger than 256kB.

Address this by changing the retry buffer size to the same size as the
maximum message we support (currently 128MB, more than sufficient to
retry any UpsertRequest).

[1]: https://grpc.io/docs/guides/retry/
[2]: https://github.com/grpc/grpc/blob/master/doc/environment_variables.md
[3]: https://github.com/grpc/grpc/blob/befeeba0f57c6ed3608935d8317fd26289e7e080/include/grpc/impl/channel_arg_names.h#L321
@daverigby daverigby requested a review from jhamon May 15, 2024 15:34
@jhamon jhamon merged commit 58d27ae into main Jun 4, 2024
81 checks passed
@jhamon jhamon deleted the daver/grpc_retry_buffer_size branch June 4, 2024 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants