Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #287
zamazan4ik
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi!
Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects. The results are here. According to the tests (not only mine), PGO can help with achieving better performance. That's why I think trying to optimize Rathole with PGO can be a good idea. However, I didn't expect huge improvements since most of the code will be IO-bound and rely on an internal OS network stack. So I did some benchmarks on Rathole and want to share my results here.
Test environment
rustc 1.72.0 (5680fa18f 2023-08-23)
d2fe586f7b3caceda542ef5030be0257d6d1401c
commit) in themain
branchAlso, here is mine
iperf3 -v
:Benchmark method
I use the methodology from https://github.com/rapiz1/rathole/blob/main/docs/benchmark.md with
iperf3
variant. Measurements are done for TCP mode. Turbo boost is disabled. All binaries (rathole
andiperf
) are running on different core viataskset
liketaskset -c 0 rathole_release rathole_server.toml
andtaskset -c 3 iperf3 -c 127.0.0.1 -p 5202 -t 60
- it's done to reduce CPU scheduler noise.Release
rathole
is built withcargo build --release
, Release + PGO is built withcargo-pgo
(see the link in the end) withcargo pgo build
+ collect benchmarks +cargo pgo optimize build
.As a training set, I used the same benchmark load with
iperf3
. I built two PGOrathole
versions: client-optimized and server-optimized. For each version, I collected the corresponding workload (for the server I ran Instrumented Rathole on the server side, for the client I ran Instrumented Rathole on the client side, respectively). I didn't test merging profiles into one but expect the results would be the same.Results
The results are presented in the
iperf3
format (partially cut). All measurements are done on the same hardware/software, with the same background noise (as much as I can guarantee of course).Server mode
Rathole Release in the server mode (4 measurements):
Rathole Release + PGO-optimized in the server mode:
I've rechecked the results above multiple times (running binaries in different order, different times, etc.) - PGO-optimized binary is consistently faster than the usual Release build. In both modes, the
rathole
server was capped by CPU at 100% at one core.Rathole in the client mode
Well, here is another story. I didn't find a way to cap Rathole in client mode by CPU in my setup. Probably, it could be done via downclocking one CPU core or via some Cgroup magic - I just didn't dig a lot of time here. Instead, I measured CPU consumption by different binaries in the same benchmark as above.
According to my multiple runs, Release + PGO-optimized consumes 0.5-1.5% less CPU time than usual Release build.
Possible future steps
I can suggest the following action points:
Maybe testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too but I recommend starting from the usual PGO.
For the Rust projects, I suggest PGO optimizing with cargo-pgo.
Beta Was this translation helpful? Give feedback.
All reactions