Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QUIC proxy peering #47587

Merged
merged 9 commits into from
Nov 14, 2024
Merged

QUIC proxy peering #47587

merged 9 commits into from
Nov 14, 2024

Conversation

espadolini
Copy link
Contributor

@espadolini espadolini commented Oct 15, 2024

This PR adds experimental support for a new QUIC-based transport for proxy peering. Support is enabled by setting the TELEPORT_UNSTABLE_QUIC_PROXY_PEERING envvar to yes and proxies that opt into the experimental feature will advertise their support by heartbeating with the teleport.internal/proxy-peer-quic label set to yes, and will exclusively use the QUIC transport to connect through proxies that carry the same label.

Proxies using the QUIC transport for proxy peering expect to be able to bind a UDP socket on the peer_listen_addr address and expect to be able to send UDP packets to other QUIC-enabled proxies on their peer_public_addr and to receive packets sent to their own peer_public_addr. Enabling or disabling the QUIC transport for an existing proxy (in the host ID sense) is unsupported and will lead to very confusing behavior, and the same can be said for restarting a proxy using the QUIC transport; the feature is strictly for environments where new proxies are rolled out (like a Kubernetes deployment).

The protocol and mode of operation is currently described in the package doc in lib/proxy/peer/quic/quic.go.

@espadolini espadolini added the no-changelog Indicates that a PR does not require a changelog entry label Oct 15, 2024
lib/proxy/peer/quicserver.go Outdated Show resolved Hide resolved
lib/proxy/peer/quicclient.go Outdated Show resolved Hide resolved
@espadolini espadolini force-pushed the espadolini/quic-proxy-peering branch from 7163836 to 5fcc162 Compare October 16, 2024 16:24

// Sent from the server to the client as a response to a DialRequest. The
// message is likewise sent in protobuf binary format, prefixed by its length
// encoded as a little endian uint32.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: conventionally data sent over the wire is big endian. GRPC performs length prefixing using big endian uint32s. unless there is some compelling reason to not to, I'd suggest sticking with that convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counterpoint: it's 2024.

gRPC over HTTP/2 uses big endian for length prefixes because the HTTP/2 spec uses big endian and that's just how they happened to write the spec; protobuf itself uses little endian for any fixed-length integers, so "convention" should clearly not be a factor in any new protocol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, ignoring convention... little endian byte order is an affront to god and nature and has no place in a civilized codebase. Especially for the case of a home-brewed API that we might be called upon to debug at some point, since visually parsing little endian data is annoying/weird.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have a slight preference for big endian for over-the-wire data, if nothing else because I would expect it to be the case. That said, as long as it's well documented I'm OK with it.

Copy link
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick (no pun intended) look at API and docs.

proto/teleport/quicpeering/v1alpha/dial.proto Outdated Show resolved Hide resolved
proto/teleport/quicpeering/v1alpha/dial.proto Outdated Show resolved Hide resolved
proto/teleport/quicpeering/v1alpha/dial.proto Outdated Show resolved Hide resolved
proto/teleport/quicpeering/v1alpha/dial.proto Show resolved Hide resolved
proto/teleport/quicpeering/v1alpha/dial.proto Show resolved Hide resolved
If the status is ok (signifying no error) then the stream will stay open,
carrying the data for the connection between the user and the agent, otherwise
the stream will be closed. For sanity's sake, the size of both messages is
limited and any oversized message is treated as a protocol violation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would it know that a message is oversized, though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We read the size first, if the message is oversized we just close the stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we capped at uint32, but later I saw we have a limit on top of the uint32, which then made this line make sense. Maybe we should mention that we have an arbitrary limit, so the messages can't use the full uint32 length?

lib/proxy/peer/quic.go Outdated Show resolved Hide resolved
lib/proxy/peer/quic.go Outdated Show resolved Hide resolved
lib/proxy/peer/quicclient.go Outdated Show resolved Hide resolved
lib/service/servicecfg/proxy.go Outdated Show resolved Hide resolved
Copy link
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Server review. I think you would benefit from someone who understands QUIC (if we have any), but I did my best.

I'll wait for replies before catching up to the client, so you have time to catch up to it all.

lib/proxy/peer/quicserver.go Outdated Show resolved Hide resolved
lib/proxy/peer/quicserver.go Outdated Show resolved Hide resolved
lib/proxy/peer/quicserver.go Outdated Show resolved Hide resolved
lib/proxy/peer/quicserver.go Outdated Show resolved Hide resolved
lib/proxy/peer/quicserver.go Outdated Show resolved Hide resolved
lib/proxy/peer/quicserver.go Outdated Show resolved Hide resolved
lib/proxy/peer/quicserver.go Outdated Show resolved Hide resolved
_, err := io.Copy(nodeConn, st)
return trace.Wrap(err)
})
_ = eg.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log the error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a shot, I suspect it would be pretty spammy tho.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for debug or trace level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even at debug this is going to be spammmyish and produce more confusion than needed. Or perhaps a qerr.ApplicationError Application error 0x0 is a benign error that shouldn't be considered an error at all?

2024-10-31T12:29:56-04:00 DEBU [PROXY:QPE] error accepting a stream pid:26284.1 remote_addr:127.0.0.1:5021 internal_id:3cce12a9-5fc5-410d-999c-b5268e80a947 error:[Application error 0x0 (remote)] quic/server.go:244
2024-10-31T12:29:56-04:00 DEBU [PROXY:QPE] done forwarding data pid:26284.1 remote_addr:127.0.0.1:5021 internal_id:3cce12a9-5fc5-410d-999c-b5268e80a947 stream_id:0 error:[
ERROR REPORT:
Original Error: *qerr.ApplicationError Application error 0x0 (remote)
Stack Trace:
	github.com/gravitational/teleport/lib/proxy/peer/quic/server.go:418 github.com/gravitational/teleport/lib/proxy/peer/quic.(*Server).handleStream.func4
	golang.org/x/sync@v0.8.0/errgroup/errgroup.go:78 golang.org/x/sync/errgroup.(*Group).Go.func1
	runtime/asm_arm64.s:1223 runtime.goexit
User Message: Application error 0x0 (remote)] quic/server.go:421```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors with a zero code should now be appropriately ignored in log messages; stream and connection closing before the connection is established (or successfully fails(???)) now use a nonzero error code which is still quite aspecific but we can figure that out at a later time.

// available streams during a connection (so we can set it to 0)
st, err := c.AcceptStream(context.Background())
if err != nil {
log.DebugContext(c.Context(), "Got an error accepting a stream.", "error", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general comment, I'd rather this function returned an error and the caller made the choice to swallow it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only possible error is caused by the connection getting closed - which is also why the log line is at debug level. I'm not convinced that moving the error logging one layer above will do much for the clarity of the code, at least while this is the only exit point for the function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the logging up makes this behave like a regular erroring function, which is already a valuable readability improvement IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging happens before the defers, logging after returning would mean that the log line is related to an error that happened potentially much earlier, seeing as now we're waiting for the per-connection waitgroup to end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not typically an issue - a function errors, runs defers, returns the error and then the error gets logged by the caller. I won't push further but I do think a regular erroring func tends to be simpler to follow than the unusual over-swallowing of errors.


syntax = "proto3";

package teleport.quicpeering.v1alpha;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

Suggested change
package teleport.quicpeering.v1alpha;
package teleport.quicpeering.v1alpha1;

Bold of you to assume there will be only one alpha... ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1alpha seems to edge out (slightly) v1alpha1 in terms of google search result count - I wouldn't be opposed to v1alpha2 as a followup to v1alpha, either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it really depends on whether we expect more alphas. If yes, then go v1alpha1. Otherwise we can do v1alpha to v1alpha2 like you said.

@espadolini espadolini force-pushed the espadolini/quic-proxy-peering branch from 5fcc162 to a6678e0 Compare October 16, 2024 23:45
@espadolini espadolini changed the base branch from master to espadolini/proxypeer-slog October 16, 2024 23:45
Base automatically changed from espadolini/proxypeer-slog to master October 17, 2024 13:14
@espadolini espadolini force-pushed the espadolini/quic-proxy-peering branch from a6678e0 to 5fad0d5 Compare October 17, 2024 14:16
Copy link
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the added documentation/context. Very helpful!

proto/teleport/quicpeering/v1alpha/dial.proto Outdated Show resolved Hide resolved
lib/service/service.go Outdated Show resolved Hide resolved
@espadolini espadolini marked this pull request as ready for review October 29, 2024 17:03
@espadolini
Copy link
Contributor Author

@codingllama after shuffling some types and functions around, QUIC proxy peering is now in its own package

@public-teleport-github-review-bot

@espadolini - this PR will require admin approval to merge due to its size. Consider breaking it up into a series smaller changes.

lib/proxy/peer/internal/tls.go Outdated Show resolved Hide resolved
lib/proxy/peer/quic/client.go Show resolved Hide resolved
lib/proxy/peer/quic/client.go Outdated Show resolved Hide resolved
lib/proxy/peer/quic/client.go Show resolved Hide resolved
lib/proxy/peer/quic/client.go Show resolved Hide resolved
lib/proxy/peer/quic/client.go Outdated Show resolved Hide resolved
lib/proxy/peer/quic/client.go Outdated Show resolved Hide resolved
lib/proxy/peer/quic/client.go Outdated Show resolved Hide resolved
lib/proxy/peer/quic/server.go Outdated Show resolved Hide resolved
lib/proxy/peer/quic/server.go Show resolved Hide resolved
Copy link
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran through a few tests on this branch again and noticed one last log spam nit. I was seeing the following in the proxy logs at the conclusion of every tsh ssh session:

2024-11-07T11:56:47-05:00 DEBU [PROXY:QPE] done forwarding data pid:18690.1 remote_addr:127.0.0.1:4021 internal_id:401212a6-d467-487d-8424-06bbe1bdf215 stream_id:0 error:[
ERROR REPORT:
Original Error: *trace.ConnectionProblemError use of closed network connection
Stack Trace:
	github.com/gravitational/teleport/api@v0.0.0/utils/sshutils/chconn.go:141 github.com/gravitational/teleport/api/utils/sshutils.(*ChConn).Read
	github.com/gravitational/teleport/lib/reversetunnel/conn_metric.go:80 github.com/gravitational/teleport/lib/reversetunnel.(*metricConn).Read
	io/io.go:429 io.copyBuffer
	io/io.go:388 io.Copy
	github.com/gravitational/teleport/lib/proxy/peer/quic/server.go:389 github.com/gravitational/teleport/lib/proxy/peer/quic.(*Server).handleStream.func3
	golang.org/x/sync@v0.8.0/errgroup/errgroup.go:78 golang.org/x/sync/errgroup.(*Group).Go.func1
	runtime/asm_arm64.s:1223 runtime.goexit
User Message: use of closed network connection] quic/server.go:415

@espadolini
Copy link
Contributor Author

@codingllama @fspmarshall friendly ping

Copy link
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay. Here's a pass on the "easy" parts, I'm still giving the server/client parts a proper look.

I probably said this before, but a non-trivial 2.5k lines, 45-commits PR is pretty difficult to hold in ones head/reason about. This would benefit from being split into human-friendly sized PRs.

lib/reversetunnelclient/peer.go Outdated Show resolved Hide resolved
lib/proxy/peer/internal/clientconn.go Outdated Show resolved Hide resolved
lib/proxy/peer/internal/clientconn.go Outdated Show resolved Hide resolved
lib/proxy/peer/internal/tls.go Outdated Show resolved Hide resolved
lib/proxy/peer/internal/tls.go Outdated Show resolved Hide resolved
lib/proxy/peer/client.go Outdated Show resolved Hide resolved
lib/proxy/peer/client.go Outdated Show resolved Hide resolved
lib/proxy/peer/quic/quic.go Show resolved Hide resolved
lib/proxy/peer/quic/quic.go Show resolved Hide resolved
lib/proxy/peer/quic/quic.go Outdated Show resolved Hide resolved
@espadolini espadolini changed the base branch from master to espadolini/quic-proxy-peering-preparation November 12, 2024 19:09
@espadolini
Copy link
Contributor Author

@codingllama I've split off the code reorganization parts in #48836, but I don't really see a sensible way to split off the client and server parts of the QUIC implementation without it being somewhat meaningless. I've rebased the commits tho, they should be far more manageable now.

@espadolini espadolini force-pushed the espadolini/quic-proxy-peering branch from 99caa90 to 2d1239f Compare November 12, 2024 19:27
@espadolini espadolini changed the base branch from espadolini/quic-proxy-peering-preparation to espadolini/proxy-peering-ping November 12, 2024 19:31
@espadolini espadolini force-pushed the espadolini/proxy-peering-ping branch from 5cc26a9 to 9422a05 Compare November 13, 2024 16:42
Base automatically changed from espadolini/proxy-peering-ping to master November 13, 2024 18:04
@espadolini espadolini force-pushed the espadolini/quic-proxy-peering branch from 2d1239f to 5a736b3 Compare November 14, 2024 09:09
@espadolini espadolini force-pushed the espadolini/quic-proxy-peering branch from 5a736b3 to 12138b1 Compare November 14, 2024 09:25
lib/proxy/peer/quic/client.go Show resolved Hide resolved
lib/proxy/peer/quic/client.go Show resolved Hide resolved
lib/proxy/peer/quic/client.go Show resolved Hide resolved
lib/proxy/peer/quic/client.go Outdated Show resolved Hide resolved
lib/proxy/peer/quic/client.go Show resolved Hide resolved
lib/proxy/peer/quic/quic_test.go Outdated Show resolved Hide resolved
lib/proxy/peer/quic/quic_test.go Outdated Show resolved Hide resolved

require.NoError(t, conn.Close())
t.Log("closed")
<-pipeClose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait with a timeout so it fails cleanly on a delay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a clean failure, or a timeout that doesn't respect the test timeout as specified by the harness?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global test timeout is usually too high, so I think it's worth adding an explicit 1s-2s here in case it locks for some reason. (As of now obviously this works, so it would have little effect.)

lib/proxy/peer/quic/quic_test.go Outdated Show resolved Hide resolved
lib/proxy/peer/quic/quic_test.go Outdated Show resolved Hide resolved
return
}

var eg errgroup.Group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the group limit to 2, to be explicit/safe.

Should we use the group context to cancel in-flight goroutines, or is that redundant? This probably deserves a brief comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call Go twice in succession and then immediately call Wait, we don't move or reuse the errgroup, and there's no errgroup context, I'm not entirely sure what the comment should be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, but do consider setting the limit.

lib/proxy/peer/quic/server.go Outdated Show resolved Hide resolved
@espadolini espadolini enabled auto-merge November 14, 2024 18:33
@espadolini espadolini added this pull request to the merge queue Nov 14, 2024
Merged via the queue into master with commit bb5e64b Nov 14, 2024
41 checks passed
@espadolini espadolini deleted the espadolini/quic-proxy-peering branch November 14, 2024 18:59
@public-teleport-github-review-bot

@espadolini See the table below for backport results.

Branch Result
branch/v17 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/branch/v17 no-changelog Indicates that a PR does not require a changelog entry size/xl
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants