gRPC: Allow retries of up to MAX_MSG_SIZE #347
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
gRPC has a built-in retry mechanism1 which we configure to automatically retry on status UNAVAILABLE messages from Pinecone.
However, it has been observed that VectorService/Upsert method is not being retried automatically and causes an exception to be thrown to the application:
Enabling gRPC's tracing2 by setting env vars 'GRPC_VERBOSITY=debug GRPC_TRACE=all' (warning - this is very verbose!) highlighted that when we do get an StatusCode.UNAVAILABLE, retry is not considered as the request is too large ("committing" in this context means it effectively disables retry attempts):
As per gRPC's options3, the max buffer size is controlled via:
Given Upsert messages are frequently larger than 256KiB (it is common to batch up to the 2 MB limit), we will fail to retry any batches larger than 256kB.
Solution
Address this by changing the retry buffer size to the same size as the maximum message we support (currently 128MB, more than sufficient to retry any UpsertRequest).
Type of Change
Test Plan
No existing test infra to automate testing of this (no way to do error injection); manually verified that previously seen (intermittent) UNAVAILABLE responses are correctly retried.