-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QUIC proxy peering #47587
QUIC proxy peering #47587
Conversation
7163836
to
5fcc162
Compare
|
||
// Sent from the server to the client as a response to a DialRequest. The | ||
// message is likewise sent in protobuf binary format, prefixed by its length | ||
// encoded as a little endian uint32. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: conventionally data sent over the wire is big endian. GRPC performs length prefixing using big endian uint32s. unless there is some compelling reason to not to, I'd suggest sticking with that convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counterpoint: it's 2024.
gRPC over HTTP/2 uses big endian for length prefixes because the HTTP/2 spec uses big endian and that's just how they happened to write the spec; protobuf itself uses little endian for any fixed-length integers, so "convention" should clearly not be a factor in any new protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, ignoring convention... little endian byte order is an affront to god and nature and has no place in a civilized codebase. Especially for the case of a home-brewed API that we might be called upon to debug at some point, since visually parsing little endian data is annoying/weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also have a slight preference for big endian for over-the-wire data, if nothing else because I would expect it to be the case. That said, as long as it's well documented I'm OK with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick (no pun intended) look at API and docs.
lib/proxy/peer/quic.go
Outdated
If the status is ok (signifying no error) then the stream will stay open, | ||
carrying the data for the connection between the user and the agent, otherwise | ||
the stream will be closed. For sanity's sake, the size of both messages is | ||
limited and any oversized message is treated as a protocol violation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would it know that a message is oversized, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We read the size first, if the message is oversized we just close the stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we capped at uint32, but later I saw we have a limit on top of the uint32, which then made this line make sense. Maybe we should mention that we have an arbitrary limit, so the messages can't use the full uint32 length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Server review. I think you would benefit from someone who understands QUIC (if we have any), but I did my best.
I'll wait for replies before catching up to the client, so you have time to catch up to it all.
lib/proxy/peer/quicserver.go
Outdated
_, err := io.Copy(nodeConn, st) | ||
return trace.Wrap(err) | ||
}) | ||
_ = eg.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give it a shot, I suspect it would be pretty spammy tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for debug or trace level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even at debug this is going to be spammmyish and produce more confusion than needed. Or perhaps a qerr.ApplicationError Application error 0x0
is a benign error that shouldn't be considered an error at all?
2024-10-31T12:29:56-04:00 DEBU [PROXY:QPE] error accepting a stream pid:26284.1 remote_addr:127.0.0.1:5021 internal_id:3cce12a9-5fc5-410d-999c-b5268e80a947 error:[Application error 0x0 (remote)] quic/server.go:244
2024-10-31T12:29:56-04:00 DEBU [PROXY:QPE] done forwarding data pid:26284.1 remote_addr:127.0.0.1:5021 internal_id:3cce12a9-5fc5-410d-999c-b5268e80a947 stream_id:0 error:[
ERROR REPORT:
Original Error: *qerr.ApplicationError Application error 0x0 (remote)
Stack Trace:
github.com/gravitational/teleport/lib/proxy/peer/quic/server.go:418 github.com/gravitational/teleport/lib/proxy/peer/quic.(*Server).handleStream.func4
golang.org/x/sync@v0.8.0/errgroup/errgroup.go:78 golang.org/x/sync/errgroup.(*Group).Go.func1
runtime/asm_arm64.s:1223 runtime.goexit
User Message: Application error 0x0 (remote)] quic/server.go:421```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errors with a zero code should now be appropriately ignored in log messages; stream and connection closing before the connection is established (or successfully fails(???)) now use a nonzero error code which is still quite aspecific but we can figure that out at a later time.
lib/proxy/peer/quicserver.go
Outdated
// available streams during a connection (so we can set it to 0) | ||
st, err := c.AcceptStream(context.Background()) | ||
if err != nil { | ||
log.DebugContext(c.Context(), "Got an error accepting a stream.", "error", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general comment, I'd rather this function returned an error and the caller made the choice to swallow it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only possible error is caused by the connection getting closed - which is also why the log line is at debug level. I'm not convinced that moving the error logging one layer above will do much for the clarity of the code, at least while this is the only exit point for the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the logging up makes this behave like a regular erroring function, which is already a valuable readability improvement IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logging happens before the defers, logging after returning would mean that the log line is related to an error that happened potentially much earlier, seeing as now we're waiting for the per-connection waitgroup to end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not typically an issue - a function errors, runs defers, returns the error and then the error gets logged by the caller. I won't push further but I do think a regular erroring func tends to be simpler to follow than the unusual over-swallowing of errors.
|
||
syntax = "proto3"; | ||
|
||
package teleport.quicpeering.v1alpha; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
package teleport.quicpeering.v1alpha; | |
package teleport.quicpeering.v1alpha1; |
Bold of you to assume there will be only one alpha... ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v1alpha
seems to edge out (slightly) v1alpha1
in terms of google search result count - I wouldn't be opposed to v1alpha2
as a followup to v1alpha
, either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it really depends on whether we expect more alphas. If yes, then go v1alpha1. Otherwise we can do v1alpha to v1alpha2 like you said.
5fcc162
to
a6678e0
Compare
a6678e0
to
5fad0d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the added documentation/context. Very helpful!
@codingllama after shuffling some types and functions around, QUIC proxy peering is now in its own package |
@espadolini - this PR will require admin approval to merge due to its size. Consider breaking it up into a series smaller changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran through a few tests on this branch again and noticed one last log spam nit. I was seeing the following in the proxy logs at the conclusion of every tsh ssh session:
2024-11-07T11:56:47-05:00 DEBU [PROXY:QPE] done forwarding data pid:18690.1 remote_addr:127.0.0.1:4021 internal_id:401212a6-d467-487d-8424-06bbe1bdf215 stream_id:0 error:[
ERROR REPORT:
Original Error: *trace.ConnectionProblemError use of closed network connection
Stack Trace:
github.com/gravitational/teleport/api@v0.0.0/utils/sshutils/chconn.go:141 github.com/gravitational/teleport/api/utils/sshutils.(*ChConn).Read
github.com/gravitational/teleport/lib/reversetunnel/conn_metric.go:80 github.com/gravitational/teleport/lib/reversetunnel.(*metricConn).Read
io/io.go:429 io.copyBuffer
io/io.go:388 io.Copy
github.com/gravitational/teleport/lib/proxy/peer/quic/server.go:389 github.com/gravitational/teleport/lib/proxy/peer/quic.(*Server).handleStream.func3
golang.org/x/sync@v0.8.0/errgroup/errgroup.go:78 golang.org/x/sync/errgroup.(*Group).Go.func1
runtime/asm_arm64.s:1223 runtime.goexit
User Message: use of closed network connection] quic/server.go:415
@codingllama @fspmarshall friendly ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay. Here's a pass on the "easy" parts, I'm still giving the server/client parts a proper look.
I probably said this before, but a non-trivial 2.5k lines, 45-commits PR is pretty difficult to hold in ones head/reason about. This would benefit from being split into human-friendly sized PRs.
54d37e0
to
23901af
Compare
@codingllama I've split off the code reorganization parts in #48836, but I don't really see a sensible way to split off the client and server parts of the QUIC implementation without it being somewhat meaningless. I've rebased the commits tho, they should be far more manageable now. |
99caa90
to
2d1239f
Compare
5cc26a9
to
9422a05
Compare
2d1239f
to
5a736b3
Compare
5a736b3
to
12138b1
Compare
lib/proxy/peer/quic/quic_test.go
Outdated
|
||
require.NoError(t, conn.Close()) | ||
t.Log("closed") | ||
<-pipeClose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait with a timeout so it fails cleanly on a delay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be a clean failure, or a timeout that doesn't respect the test timeout as specified by the harness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The global test timeout is usually too high, so I think it's worth adding an explicit 1s-2s here in case it locks for some reason. (As of now obviously this works, so it would have little effect.)
return | ||
} | ||
|
||
var eg errgroup.Group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set the group limit to 2, to be explicit/safe.
Should we use the group context to cancel in-flight goroutines, or is that redundant? This probably deserves a brief comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call Go twice in succession and then immediately call Wait, we don't move or reuse the errgroup, and there's no errgroup context, I'm not entirely sure what the comment should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, but do consider setting the limit.
@espadolini See the table below for backport results.
|
This PR adds experimental support for a new QUIC-based transport for proxy peering. Support is enabled by setting the
TELEPORT_UNSTABLE_QUIC_PROXY_PEERING
envvar toyes
and proxies that opt into the experimental feature will advertise their support by heartbeating with theteleport.internal/proxy-peer-quic
label set toyes
, and will exclusively use the QUIC transport to connect through proxies that carry the same label.Proxies using the QUIC transport for proxy peering expect to be able to bind a UDP socket on the
peer_listen_addr
address and expect to be able to send UDP packets to other QUIC-enabled proxies on theirpeer_public_addr
and to receive packets sent to their ownpeer_public_addr
. Enabling or disabling the QUIC transport for an existing proxy (in the host ID sense) is unsupported and will lead to very confusing behavior, and the same can be said for restarting a proxy using the QUIC transport; the feature is strictly for environments where new proxies are rolled out (like a Kubernetes deployment).The protocol and mode of operation is currently described in the package doc in
lib/proxy/peer/quic/quic.go
.