Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPC requests erroring with upstream_reset_after_response_started{remote_reset} after timeout in envoy v1.31.0 #36188

Open
shulin-sq opened this issue Sep 18, 2024 · 3 comments
Labels
area/grpc bug triage Issue requires triage

Comments

@shulin-sq
Copy link
Contributor

shulin-sq commented Sep 18, 2024

Title: GRPC requests erroring with upstream_reset_after_response_started{remote_reset} after timeout in envoy v1.31.0

Description:
Apologies if the repro steps here are a bit bare, we are still trying to find a reasonable repro, but I wanted to open an issue in case there was an obvious related discussion about this already that I missed. My ask is this: is there something that went out in 1.31.0 that may have caused issues with how envoy handles grpc?

We recently upgraded from envoy 1.30.3 to 1.31.0. After this release we saw this behavior in our mesh:

{
		"attributes": {
			"envoy_status_code": "200",
			"bytes_received": 99,
			"timing": {
				"duration": 30001,
				"request_duration": 0,
				"response_duration": 7
			},
			"bytes_sent": 0,
			"size": 2612,
			"response": {
				"response_code_details": "upstream_reset_after_response_started{remote_reset}",
				"grpc-status": "-/-", # this format is "grpc-status": "%RESP(GRPC-STATUS)%/%TRAILER(GRPC-STATUS)%",
				"response_code": 200,
				"x-envoy-upstream-service-time": "6",
				"flags": "UR"
			},
}

grpc responses were erroring with upstream_reset_after_response_started{remote_reset} and timing out at the max timeout of 30000 ms

Normally these requests take <10ms and are not very large in bandwidth.

Reverting version to 1.30.3 seems to have "fixed" the issue. Our current running theory is that

  • envoy saw headers from the upstream
  • envoy was waiting for trailers, (thus the status is -/-) but given that the status is empty it looks like the trailer did not include the status code.

What issue is being seen? Describe what should be happening instead of
the bug, for example: Envoy should not crash, the expected value isn't
returned, etc.

here's the log line of a "normal" request (relevant parts)

{
		"attributes": {
			"envoy_status_code": "200",
			"bytes_received": 100,
			"timing": {
				"duration": 11,
				"response_tx_duration": 0,
				"request_duration": 0,
				"response_duration": 11
			},
			"bytes_sent": 80,
			"size": 2565,
			"response": {
				"response_code_details": "via_upstream",
				"grpc-status": "-/0", # this format is "grpc-status": "%RESP(GRPC-STATUS)%/%TRAILER(GRPC-STATUS)%",
				"response_code": 200,
				"x-envoy-upstream-service-time": "10",
				"flags": "-"
			},
		}
	}
}

the client is grpc-java-netty/1.62.2

@shulin-sq shulin-sq added bug triage Issue requires triage labels Sep 18, 2024
@zuercher
Copy link
Member

I don't see any other reports of similar behavior. Can you provide some details on your Envoy configuration? Can you correlate the request/response on the upstream server with request/response from Envoy? That is, where is the 30 seconds being spent? A packet trace might be useful.

@shulin-sq
Copy link
Contributor Author

@zuercher unfortunately this issue has been very hard to reproduce. It only occurs after some time in one of our prod applications and I have been unable to create a test environment that does the same thing.

for now I can share a few more suspicions in case it helps with triaging

  • we suspect it may be related to some of the http2 changes that went out in 1.31.0 like the oghttp2 flag, GHSA-qc52-r4x5-9w37 looks interesting as this flag was reverted in the recent envoy release and I'm wondering what issues were observed here and if they sound similar to what I've described?
  • issue only occurs when envoy -> application connection is http2

Our traces show that the application responds quickly and the 30s is spent in envoy. The app has responded already, but envoy waits until the timeout. The timeout is set in the grpc client.

@zuercher
Copy link
Member

It's possible this is related. It would be useful to know if upgrading to 1.31.2 (or disabling oghttp2 in your build using the runtime flag) resolves the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/grpc bug triage Issue requires triage
Projects
None yet
Development

No branches or pull requests

2 participants