Skip to content

Latest commit

 

History

History
370 lines (284 loc) · 16.5 KB

BENCHMARKING.md

File metadata and controls

370 lines (284 loc) · 16.5 KB

Benchmarking Wild

Benchmarking against other linkers

If you decide to benchmark Wild against other linkers, in order to make it a fair comparison, you should ensure that the other linkers aren't doing work on something that Wild doesn't support. In particular:

  • Wild defaults to --gc-sections, so for a fair comparison, that should be passed to all the linkers.
  • Wild defaults to -z now, so best to pass that to all linkers.
  • Wild doesn't support linker plugins and various other flags. Generally use of these flags will result in a warning.

How to benchmark

Preparing the "run-with" files

For benchmarking the linker, it's preferable to run just the linker, not the whole build process.

The way to do that is by capturing the linker invocation so that it can be rerun. Wild has a built-in way to do that.

You can benchmark linking of either a debug or a release build of a crate, this depends on what comparisons you wish to make, or what change in wild you want to quantify.

Follow-these steps:

  • Chose the crate that you wish to use in your benchmark, clone it, cd into it's root directory and make sure it builds with cargo build (for a rust project)
  • Clean the build using cargo clean
  • To force the build of your chosen crate to link using wild, we have a couple of options:
    • Prefix the cargo build command with RUSTFLAGS="-Clinker=clang -Clink-arg=--ld-path=wild"
    • Modify (or add) the .cargo/config.toml file in your chosen crate (example for ripgrep)
  [target.x86_64-unknown-linux-gnu]
linker = "/usr/bin/clang"

rustflags = [
    "-C", "link-arg=--ld-path=wild"
]
  • Make sure that you have a version of wild in your $PATH so that it will be used (try which wild to check)
  • Run WILD_SAVE_BASE=/tmp/wild/ripgrep cargo build in the crate's root directory (include RUSTFLAGS as above if you have chosen that method)
  • You will get a few numbered subdirectories in /tmp/wild/ripgrep as part of the build process.
    • Directories will be created for builds of build scripts, proc macros and crate binaries built
    • Usually the last numbered subdirectory will be the build of crate's binary (if a single binary is built)
    • You can check what each file is linking using tail -n 1 /tmp/wild/ripgrep/*/run-with
    • In the case of ripgrep it is '6'
  • You can then run /tmp/wild/ripgrep/6/run-with wild and that will rerun the link with wild

When you run run-with wild, the linker may print warnings for unsupported flags. It's a good idea to edit the run-with script to change / delete these flags. This will make comparison with other linkers more fair, since some of these unsupported flags may involve other linkers doing significant amounts of extra work.

Run benchmark with hyperfine

Let's benchmark the linking stage between ld, mold and wild, discarding the first two runs of each to reduce the effects of cache warmup

hyperfine --warmup 2 '/tmp/wild/ripgrep/6/run-with ld' '/tmp/wild/ripgrep/6/run-with mold' '/tmp/wild/ripgrep/6/run-with wild'

That should produce output similar to this (with different values):

Benchmark 1: /tmp/wild/ripgrep/6/run-with ld
  Time (mean ± σ):     954.1 ms ±  13.6 ms    [User: 683.4 ms, System: 268.8 ms]
  Range (min … max):   920.6 ms … 970.7 ms    10 runs
 
Benchmark 2: /tmp/wild/ripgrep/6/run-with mold
  Time (mean ± σ):     146.1 ms ±   3.6 ms    [User: 52.0 ms, System: 2.4 ms]
  Range (min … max):   139.1 ms … 154.7 ms    19 runs
 
Benchmark 3: /tmp/wild/ripgrep/6/run-with wild
  Time (mean ± σ):      87.7 ms ±   2.8 ms    [User: 2.4 ms, System: 2.0 ms]
  Range (min … max):    81.5 ms …  92.5 ms    34 runs
 
Summary
  /tmp/wild/ripgrep/6/run-with wild ran
    1.67 ± 0.07 times faster than /tmp/wild/ripgrep/6/run-with mold
   10.88 ± 0.38 times faster than /tmp/wild/ripgrep/6/run-with ld

Run benchmark with poop

An alternative tool to hyperfine, that reports some additional metrics is poop.

Like hyperfine it takes a number of commands and runs each a number of times and gathers statistics about each tune.

poop '/tmp/wild/ripgrep/6/run-with ld' '/tmp/wild/ripgrep/6/run-with mold' '/tmp/wild/ripgrep/6/run-with wild'

It should produce output similar to this (with different numbers!):

Benchmark 1 (5 runs): /tmp/wild/ripgrep/6/run-with ld
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.18s  ±  335ms     926ms … 1.68s           0 ( 0%)        0%
  peak_rss            288MB ±  276KB     287MB …  288MB          1 (20%)        0%
  cpu_cycles         2.51G  ±  341M     2.28G  … 3.06G           0 ( 0%)        0%
  instructions       3.93G  ± 9.54K     3.93G  … 3.93G           0 ( 0%)        0%
  cache_references   98.7M  ± 2.59M     96.4M  …  102M           0 ( 0%)        0%
  cache_misses       41.9M  ± 2.52M     40.3M  … 46.3M           0 ( 0%)        0%
  branch_misses      9.77M  ±  223K     9.62M  … 10.2M           0 ( 0%)        0%

Benchmark 2 (31 runs): /tmp/wild/ripgrep/6/run-with mold
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           165ms ± 27.2ms     149ms …  280ms          2 ( 6%)        ⚡- 86.0% ±  9.9%
  peak_rss           7.84MB ± 96.3KB    7.60MB … 8.00MB         11 (35%)        ⚡- 97.3% ±  0.0%
  cpu_cycles         2.01G  ± 38.6M     1.97G  … 2.16G           2 ( 6%)        ⚡- 19.9% ±  4.8%
  instructions       1.99G  ± 3.12M     1.98G  … 1.99G           3 (10%)        ⚡- 49.3% ±  0.1%
  cache_references   44.8M  ±  250K     44.4M  … 45.6M           1 ( 3%)        ⚡- 54.6% ±  0.9%
  cache_misses       21.6M  ±  461K     21.3M  … 23.6M           3 (10%)        ⚡- 48.4% ±  2.3%
  branch_misses      7.17M  ± 37.7K     7.07M  … 7.25M           1 ( 3%)        ⚡- 26.6% ±  0.8%

Benchmark 3 (56 runs): /tmp/wild/ripgrep/6/run-with wild
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          89.1ms ± 3.14ms    83.0ms … 96.6ms          0 ( 0%)        ⚡- 92.4% ±  7.0%
  peak_rss           3.82MB ± 50.7KB    3.80MB … 3.93MB         10 (18%)        ⚡- 98.7% ±  0.0%
  cpu_cycles         1.26G  ± 15.1M     1.21G  … 1.31G           7 (13%)        ⚡- 49.6% ±  3.4%
  instructions       1.21G  ±  529K     1.21G  … 1.22G           5 ( 9%)        ⚡- 69.1% ±  0.0%
  cache_references   33.9M  ±  467K     32.9M  … 34.9M           0 ( 0%)        ⚡- 65.7% ±  0.8%
  cache_misses       14.4M  ±  187K     14.1M  … 14.9M           0 ( 0%)        ⚡- 65.6% ±  1.5%
  branch_misses      3.49M  ± 7.86K     3.47M  … 3.51M           0 ( 0%)        ⚡- 64.2% ±  0.6%

NOTE: Both mold and wild fork a child process and perform linking in it. Thus, the values for peak_rss, User and System corresponds to the parent process only, and hence are not representative of real use by the linker.

NOTE: poop uses the first command as the reference the others are compared against, so if focusing on wild, you might want to re-order the commands and invoke poop thus:

poop '/tmp/wild/ripgrep/6/run-with wild' '/tmp/wild/ripgrep/6/run-with mold' '/tmp/wild/ripgrep/6/run-with ld'

Comparisons

Using this method, you can benchmark:

  • between Wild and one or more other linkers
  • between different options passed to Wild - You can pass arbitrary additional arguments to run-with. The first argument needs to be the name of the linker to use. All additional arguments are passed through to the linker as-is

Caching

The use of the linux file system cache affects linker performance, as there is a lot of reasonably large files read and written. In a normal build, the object files being linked would be written previously by the compiler and may well be in the file cache. With this benchmarking method we skip the previous build steps and the linker incurs the penalty of reading those files into cache the first time they are read.

To reduce the effect this has on benchmarked time we run hyperfine with the --warmup 2 option, and the results of the first two runs are not used in the calculations.

Disk write bottlenecks

When benchmarking, if the output file is being written to persistent storage (hard disk or SSD), the writes can build up and cause the linkers to block. Worse, writes from a previous linker invocation might contribute to this backlog. Whether this happens depends on how much RAM you have free and also your kernel settings. For example, if you run cat /proc/sys/vm/dirty_ratio that will show the percentage of reclaimable memory that is allowed to be dirty (needing writing) before further writes will block. If that shows zero, then cat /proc/sys/vm/dirty_bytes will show the same, but as an absolute number of bytes. On some systems, the the absolute dirty byte limit might be set as low as 256MiB, meaning that if we're writing a large output file, we can easily hit this limit. You could increase this limit, or switch to using dirty_ratio of say 20% instead, but it might be better to just take the filesystem out of the equation and write the output to a tmpfs instead. See next section.

Tmpfs

As discussed in the last section, writing to a physical disk can cause inconsistent benchmark results. It can also contribute wearing out your SSD. For these reasons, it's recommended to benchmark with the output file on tmpfs.

If you don't already have a suitable tmpfs to use, you can create one something like the following:

sudo mkdir /benchmark
sudo mount -t tmpfs none /benchmark

Then when running the benchmark, set the output file to be on this filesystem. e.g.:

OUT=/benchmark/out hyperfine --warmup 2 '/tmp/wild/ripgrep/6/run-with ld' '/tmp/wild/ripgrep/6/run-with mold' '/tmp/wild/ripgrep/6/run-with wild'

Watch out for thermal throttling

If your CPUs get hot while running the benchmark, this can cause inconsistent results. You can check for throttle events by looking for increases in /sys/devices/system/cpu/cpu*/thermal_throttle/package_throttle_count and /sys/devices/system/cpu/cpu*/thermal_throttle/core_throttle_count between when you start the benchmark and when you finish. Ideally, these should be unchanged.

One thing that can help is if you have a way to turn your fans to maximum before you start the benchmark run.

Another possibility is to give the CPUs a chance to cool down between each run, e.g. by sleeping. With hyperfine, you can do this by adding an argument like --prepare "sleep 2". You might need to experiment with the duration of the sleep.

What to benchmark

rustc

When building rustc, most of the rustc code goes into a shared object called rustc-driver. This shared object is about 230 MiB without debug info and 462 MiB with debug info. While not as large as some binaries, this is still a pretty reasonable size, making it good for benchmarking. It's also an interesting benchmark because it's a shared object rather than an executable.

Build rustc as per the instructions on the rustc-dev-guide. Before building, edit or create config.toml in your rust directory to contain:

[rust]
use-lld = "self-contained"

Once you've successfully built rustc, build it again, but using wild as the linker. In the following command, replace $HOME/work/wild with the path to the directory containing the wild repo. You'll need to have already built wild with cargo build --release.

touch compiler/rustc_driver/src/lib.rs
WILD_SAVE_BASE=$HOME/tmp/rustc-link PATH=$HOME/work/wild/fakes:$PATH ./x build --keep-stage 1

You should now have a few subdirectories under $HOME/tmp/rustc-link. You can identify which one is rustc_driver by looking at the last line of the run-with script in each directory.

If the directory $HOME/tmp/rustc-link didn't get created, then most likely wild wasn't used to link. You can check what linker was used with readelf. e.g.:

readelf -p .comment build/x86_64-unknown-linux-gnu/stage1/bin/rustc

String dump of section '.comment':
  [     0]  GCC: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
  [    2c]  rustc version 1.83.0-beta.1 (0125edf41 2024-10-14)
  [    5f]  Linker: Wild version 0.3.0

You can also verify the PATH setting, by running something like the following:

PATH=$HOME/work/wild/fakes:$PATH ld.lld --version
Wild version 0.3.0 (compatible with GNU linkers)

Other tools

  • poop - gives a lot of measurements other than just time. Note that the peak_rss measurement won't be accurate for wild and mold unless you include the --no-fork argument to the linker.

Profiling

--time

To figure out where wild is spending time, the first option is to run with --time. It's recommended to combine this with --no-fork. For example:

~/tmp/rustc-link/0/run-with target/release/wild --strip-debug --time --no-fork
┌───    3.84 Open input files
├───    7.45 Split archives
├───    9.59 Parse input files
│ ┌───    2.91 Parse version script
│ ├───   16.67 Read symbols
│ ├───   15.21 Populate symbol map
├─┴─   37.68 Build symbol DB
│ ┌───   29.02 Resolve symbols
│ ├───   33.59 Resolve sections
│ ├───    2.20 Assign section IDs
│ ├───   15.39 Merge strings
│ ├───    0.04 Canonicalise undefined symbols
│ ├───    4.63 Resolve alternative symbol definitions
├─┴─   84.97 Symbol resolution
│ ┌───   76.63 Find required sections
│ ├───    0.16 Merge dynamic symbol definitions
│ ├───   18.74 Finalise per-object sizes
│ ├───    0.12 Apply non-addressable indexes
│ ├───    0.06 Compute total section sizes
│ ├───    0.01 Compute segment layouts
│ ├───    0.00 Compute per-alignment offsets
│ ├───    0.14 Compute per-group start offsets
│ ├───    0.00 Compute merged string section start addresses
│ ├───   18.10 Assign symbol addresses
│ ├───    0.30 Update dynamic symbol resolutions
├─┴─  114.85 Layout
│ ┌───    0.00 Wait for output file creation
│ │ ┌───    0.63 Split output buffers by group
│ ├─┴─  157.42 Write data to file
│ ├───   15.05 Sort .eh_frame_hdr
├─┴─  172.71 Write output file
│ ┌───   14.45 Unmap output file
│ ├───    7.27 Drop layout
│ ├───    0.01 Drop symbol DB
│ ├───   23.35 Drop input data
├─┴─   45.15 Shutdown
└─  481.09 Link

Samply

To look for hot functions and to check how the work distribution looks between threads, you can use samply.

For this to be useful, you likely want optimisations and debug info. We have an opt-debug profile set up for this purpose.

cargo build --profile opt-debug
~/tmp/rustc-link/0/run-with samply record target/opt-debug/wild --strip-debug

The result will look something like this. This is using the Firefox profiler, so you'll need to open that link in Firefox.

One thing you'll likely notice when looking at the flamegraph is that there's lots of rayon stuff and that makes it hard to see what's going on. The issue is that rayon uses recursion and the exact sequence of calls it goes through before it gets to our code varies. The trick to seeing through this is to collapse that recursion. For example, find rayon::iter::plumbing::bridge_producer_consumer::helper, right click and select Collapse recursion (or 'r'). If there's any extra rayon stack frames that you'd like to ignore, you can select them and press 'm' to merge them.

Heap profiling with dhat

Build with profiling enabled:

cargo build --profile opt-debug --features dhat

Then run the linker on some input. e.g:

~/tmp/rustc-link/0/run-with target/opt-debug/wild --no-fork

This should print some stats on exit. e.g.:

dhat: Total:     250,699,127 bytes in 130,224 blocks
dhat: At t-gmax: 111,265,627 bytes in 14,117 blocks
dhat: At t-end:  96,320 bytes in 109 blocks
dhat: The data has been saved to dhat-heap.json, and is viewable with dhat/dh_view.html

You can then upload dhat-heap.json to the online dhat viewer.

For more details, see the dhat docs.