Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] eBPF offload consideration #360

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

levaitamas
Copy link
Contributor

Hi,

Me and @rg0now have been investigating on boosting pion/turn performance with eBPF. As a first step, we implemented an eBPF/XDP offload for UDP channel bindings. This way, pion/turn can offload the channel data processing to the kernel. Below we present our implementation details, early results and call for a discussion to consider eBPF offload in pion/turn.

Implementation details

How does it work?

The XDP offload handles ChannelData messages only. The userspace TURN server is responsible for all the other functionality from building channels to handle requests and etc. The offload mechanisms are activated after a successful channel binding, in the method Allocation.AddChannelBind. The userspace TURN server sends peer and client info (5-tuples and channel id) to the XDP program via an eBPF map. From that point the XDP program can detect channel data coming from the peer or from the client. When a channel binding gets removed the corresponding data will be deleted from the eBPF maps and thus there will be no offload for that channel.

Changes to pion/turn

New: We introduce a new internal offload package, which manages offload mechanisms. Currently, there are two implementations: the XDPOffload that uses XDP, and a NullOffload for testing purposes.

Changed: The kernel offload complicates lifecycle management since eBPF/XDP offload outlives TURN server objects. This calls for new public methods in package turn to manage the offload engine's lifetime: InitOffload starts the offload engine (e.g., loads the XDP program and creates eBPF maps) and ShutdownOffload removes the offload engine. Note that these methods should be called by the application as shown in the server_test.go benchmark.

But after everything is set up, channel binding offload management happens in Allocation.AddChannelBind and Allocation.DeleteChannelBind with no change in their usage.

eBPF/XDP details

The XDP part consist of a program that describes the packet processing logic to be executed when the network interface receives a packet. The XDP program uses eBPF maps to communicate with the user space TURN server.

Maps: The XDP offload uses the following maps to keep track of connections, store statistics, and to aid traffic redirects between interfaces:

name key value function
turn_server_downstream_map peer 5-tuple client 5-tuple + channel-id match peer -> client traffic
turn_server_upstream_map client 5-tuple + channel-id peer 5-tuple match client -> peer traffic
turn_server_stats_map 5-tuple + channel id stats (#pkts, #bytes) traffic statistics per connection (5-tuple and channel-id)
turn_server_interface_ip_addresses_map interface index IPv4 address interface IP addresses for redirects

XDP Program: The XDP program receives all packets as they arrive to the network interface. It filters IPv4/UDP packets (caveat: VLAN and other tunneling options are not supported), and checks whether the packets belong to any channel binding (i.e., checks the 5-tuple and channel-id). If there is a match, the program does the ChannelData handling: updates 5-tuple, adds or removes the ChannelData header, keeps track of statistics, and finally redirects the packet to the corresponding network interface. Other non channel data packets are passed to the network stack for further processing (e.g., channel refresh messages and other STUN/TURN traffic goes to user space TURN server).

Results

CPU profiling

Prior results are promising. The CPU profiling with the benchmark (#298) shows that the server.ReadLoop() that took 47.9 sec before, runs for 0.96 sec with the XDP offload.

Flame graph w/o the offload:
No_offload

Flame graph w/ XDP offload:
XDP_offload

Microbenchmark with simple-server

Measurements with iperf, turncat (our in-house TURN proxy), and the simple-server example show outstanding (150x!) delay reduction and significant (6x) bandwidth boost.

Measurement setup

Delay results

avg[ms] simple multi xdp
avg 3.944 4.311 0.033
min 3.760 0.473 0.023
median 3.914 4.571 0.027
max 4.184 5.419 0.074

Bandwidth results

Note
iperf stalls at ~220k pps, we assume 1+ mpps with a powerful load generator

[pps] simple multi xdp
avg 36493 96152 227378
min 35241 91856 222567
median 36617 96843 227783
max 37545 99455 233559

Discussion

  • XDP offload is straightforward for UDP connections, but is cumbersome for TCP and TLS. Fortunately, the eBPF ecosystem provides other options: tc and sockmap can be potential alternatives with a reasonable complexity-performance trade-off.
    • Yet we need to coordinate the different offload mechanisms for different connections.
    • In addition, offload mechanisms introduce new lifecycle management scale: these mechanisms overlive TURN server objects.
  • The eBPF objects needs to be built and distributed, and this makes the build process more complex.
    • New dependency: cilium/ebpf.
    • Build process gets more complex: eBPF objs are built via go generate; how to integrate it with the current build process; e.g., add a Makefile?
  • Monitoring is not trivial due to the lifetime of XDP objects and becasue in XDP conections are identified by 5-tuples and we loose the notion of 'listeners'.
    • Therefore, current monitoring implementation is initial. The bytes and pkts sent via a 5-tuple are stored in a statistics eBPF map. We update the counters in the statistics map, but we do not delete from it. There is no interface exposed for querying statistics (one can use bpftool to dump the map content)
  • XDP Limitations: The bpf_redirect() that handles packet redirects in eBPF/XDP supports redirects to NIC egress queues in XDP. This prevents supporting scenarios when clients exchange traffic in a server-local 'loop'.
    • We disabled the XDP offload for host-local redirects. We had some weird issues with forwarding traffic between NICs with the xdp driver to NICs with the xdpgeneric drivers (except the lo interface).
    • A packet size limit is set in the XDP program to prevent fragmentation. Currently the limit is 1480 bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be in-tree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! These files are generated by bpf2go during go generate. The objects can be reused, and the generation requires kernel headers, C compiler, etc. These tools might be not available at most users and also not sure if a single go get github.com/pion/turn/v3 would trigger go generate. Bundling the generated objects is common solution for this; e.g., the cilium/ebpf examples at https://github.com/cilium/ebpf/tree/main/examples contain the eBPF objects too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is very useful for attackers. Attackers can create a PR to update .o file built with malicious changes along with non-harmful code changes.

@Sean-Der
Copy link
Member

This is pretty magical! Great work :)

I am in support of adding this. I think people will find this useful

@Sean-Der
Copy link
Member

I added you to the repo @levaitamas

Unfortunately these days I don't have much bandwidth to get involved. I would love to support you everywhere I can. If you want me to add other developers so you can work together happy to do that

@levaitamas
Copy link
Contributor Author

Thanks @Sean-Der! I do appreciate adding me to the repo, since that would definitely ease supporting the eBPF offload once it gets integrated. I would recommend adding @rg0now. He has a great understanding of the pion ecosystem, and already made impactful contributions (e.g., support multi-thread UDP).

@Sean-Der
Copy link
Member

Done! That was a major oversight that @rg0now wasn’t in already :(

@rg0now
Copy link
Contributor

rg0now commented Nov 17, 2023

CC @stv0g

@stv0g
Copy link
Member

stv0g commented Nov 17, 2023

Great work! I am also in support of getting this in 👍🏻

@levaitamas levaitamas force-pushed the server-ebpf-offload branch 2 times, most recently from 4ac0f62 to 5304857 Compare November 24, 2023 12:33
@levaitamas levaitamas force-pushed the server-ebpf-offload branch 2 times, most recently from b59fd8f to 07bfae9 Compare December 1, 2023 14:26
@levaitamas levaitamas force-pushed the server-ebpf-offload branch 3 times, most recently from 120127b to 4b776f2 Compare June 17, 2024 21:19
@BertoldVdb
Copy link

BertoldVdb commented Jun 27, 2024

I was going to make a PR that adds a user configurable callback at these locations to allow configuring external network accelerators, but I see you already did it. Thanks!

https://github.com/l7mp/turn/blob/4b776f2d67b2256552f8298f450b4b0640b17183/internal/allocation/allocation.go#L128
and
https://github.com/l7mp/turn/blob/4b776f2d67b2256552f8298f450b4b0640b17183/internal/allocation/allocation.go#L171

@levaitamas levaitamas force-pushed the server-ebpf-offload branch 2 times, most recently from 4b776f2 to c5bdc92 Compare July 5, 2024 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants