You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Title: GRPC requests erroring with upstream_reset_after_response_started{remote_reset} after timeout in envoy v1.31.0
Description:
Apologies if the repro steps here are a bit bare, we are still trying to find a reasonable repro, but I wanted to open an issue in case there was an obvious related discussion about this already that I missed. My ask is this: is there something that went out in 1.31.0 that may have caused issues with how envoy handles grpc?
We recently upgraded from envoy 1.30.3 to 1.31.0. After this release we saw this behavior in our mesh:
grpc responses were erroring with upstream_reset_after_response_started{remote_reset} and timing out at the max timeout of 30000 ms
Normally these requests take <10ms and are not very large in bandwidth.
Reverting version to 1.30.3 seems to have "fixed" the issue. Our current running theory is that
envoy saw headers from the upstream
envoy was waiting for trailers, (thus the status is -/-) but given that the status is empty it looks like the trailer did not include the status code.
What issue is being seen? Describe what should be happening instead of
the bug, for example: Envoy should not crash, the expected value isn't
returned, etc.
here's the log line of a "normal" request (relevant parts)
I don't see any other reports of similar behavior. Can you provide some details on your Envoy configuration? Can you correlate the request/response on the upstream server with request/response from Envoy? That is, where is the 30 seconds being spent? A packet trace might be useful.
@zuercher unfortunately this issue has been very hard to reproduce. It only occurs after some time in one of our prod applications and I have been unable to create a test environment that does the same thing.
for now I can share a few more suspicions in case it helps with triaging
we suspect it may be related to some of the http2 changes that went out in 1.31.0 like the oghttp2 flag, GHSA-qc52-r4x5-9w37 looks interesting as this flag was reverted in the recent envoy release and I'm wondering what issues were observed here and if they sound similar to what I've described?
issue only occurs when envoy -> application connection is http2
Our traces show that the application responds quickly and the 30s is spent in envoy. The app has responded already, but envoy waits until the timeout. The timeout is set in the grpc client.
It's possible this is related. It would be useful to know if upgrading to 1.31.2 (or disabling oghttp2 in your build using the runtime flag) resolves the issue.
Title: GRPC requests erroring with upstream_reset_after_response_started{remote_reset} after timeout in envoy v1.31.0
Description:
Apologies if the repro steps here are a bit bare, we are still trying to find a reasonable repro, but I wanted to open an issue in case there was an obvious related discussion about this already that I missed. My ask is this: is there something that went out in 1.31.0 that may have caused issues with how envoy handles grpc?
We recently upgraded from envoy 1.30.3 to 1.31.0. After this release we saw this behavior in our mesh:
grpc responses were erroring with upstream_reset_after_response_started{remote_reset} and timing out at the max timeout of 30000 ms
Normally these requests take <10ms and are not very large in bandwidth.
Reverting version to 1.30.3 seems to have "fixed" the issue. Our current running theory is that
here's the log line of a "normal" request (relevant parts)
the client is grpc-java-netty/1.62.2
The text was updated successfully, but these errors were encountered: