Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Link-Time Optimization (LTO) #1

Open
zamazan4ik opened this issue Oct 28, 2024 · 2 comments
Open

Enable Link-Time Optimization (LTO) #1

zamazan4ik opened this issue Oct 28, 2024 · 2 comments

Comments

@zamazan4ik
Copy link

Hi!

I noticed that in the Cargo.toml file Link-Time Optimization (LTO) for the project is not enabled. I suggest switching it on since it will reduce the binary size (always a good thing to have) and will likely improve the application's performance a bit.

I suggest enabling LTO only for the Release builds so as not to sacrifice the developers' experience while working on the project since LTO consumes an additional amount of time to finish the compilation routine. If you think that a regular Release build should not be affected by such a change as well, then I suggest adding an additional dist or release-lto profile where additionally to regular release optimizations LTO will also be added. Such a change simplifies life for maintainers and others interested in the project persons who want to build the most performant version of the application. Using ThinLTO should also help to reduce the build-time overhead with LTO. E.g., check cargo-outdated Release profile.

Basically, it can be enabled with the following lines:

[profile.release]
lto = true

I have made quick tests (Fedora 40) by adding lto = true to the Release profile. The binary size reduction is from 9 Mib to 8 Mib. You may also be interested in tweaking other options like codegen-units, etc.

Thank you.

@vladkens
Copy link
Owner

vladkens commented Oct 29, 2024

Hi, @zamazan4ik. Thank you for commenting it.

Found also interesting article on topic: https://tech.dreamleaves.org/trimming-down-a-rust-binary-in-half/

And I did some tests here:

Build fresh and with cache

lto = true

> docker build --no-cache -t ogp .
 => [builder 2/4] RUN /scripts/build cook 160.9s
 => [builder 4/4] RUN /scripts/build final ogp 59.7s
# = 220s

> docker build -t ogp .
 => CACHED [builder 2/4] RUN /scripts/build cook 0.0s
 => [builder 4/4] RUN /scripts/build final ogp 61.2s

-rwxr-xr-x    1 root     root        6.4M Oct 29 19:17 ogp

lto = false

> docker build --no-cache -t ogp .
 => [builder 2/4] RUN /scripts/build cook 224.9s
 => [builder 4/4] RUN /scripts/build final ogp 10.5s
# = 234s

> docker build -t ogp .
 => CACHED [builder 2/4] RUN /scripts/build cook 0.0s
 => [builder 4/4] RUN /scripts/build final ogp 11.0s

ls -lah | grep ogp
-rwxr-xr-x    1 root     root        7.4M Oct 29 17:59 ogp

Bench Docker in QEMU (Apple ARM)

  1. static healtcheck endpoint
  2. svg generation from template
  3. png rendering (actually with extra request to load image from same server)

lto = true

> wrk -t4 -c500 -d30s 'http://localhost:8080/health'
  4 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.24ms    2.34ms  56.16ms   94.80%
    Req/Sec    14.78k     3.15k   21.86k    62.17%
  1765793 requests in 30.04s, 207.13MB read
  Socket errors: connect 253, read 102, write 0, timeout 0
Requests/sec:  58789.28
Transfer/sec:      6.90MB

> wrk -t4 -c500 -d30s 'http://localhost:8080/v0/svg?title=&author=&photo=http://localhost:8080/assets/favicon.svg&url=&theme=default'

  4 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.80ms    2.23ms  60.54ms   80.84%
    Req/Sec    10.69k     1.76k   14.47k    63.50%
  1277076 requests in 30.03s, 1.30GB read
  Socket errors: connect 253, read 104, write 0, timeout 0
Requests/sec:  42527.60
Transfer/sec:     44.29MB

> wrk -t4 -c500 -d30s 'http://localhost:8080/v0/png?title=&author=&photo=http://localhost:8080/assets/favicon.svg&url=&theme=default'
  4 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.25s   228.45ms   1.98s    75.75%
    Req/Sec    48.37     20.66   128.00     71.38%
  5790 requests in 30.09s, 515.68MB read
  Socket errors: connect 253, read 154, write 0, timeout 0
Requests/sec:    192.43
Transfer/sec:     17.14MB

lto = false

> wrk -t4 -c500 -d30s 'http://localhost:8080/health'
  4 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.36ms    2.60ms 124.67ms   95.54%
    Req/Sec    14.40k     2.08k   19.10k    74.25%
  1720145 requests in 30.03s, 201.78MB read
  Socket errors: connect 253, read 113, write 0, timeout 0
Requests/sec:  57287.87
Transfer/sec:      6.72MB

> wrk -t4 -c500 -d30s 'http://localhost:8080/v0/svg?title=&author=&photo=http://localhost:8080/assets/favicon.svg&url=&theme=default'
  4 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.09ms    2.47ms  59.52ms   82.47%
    Req/Sec    10.19k     1.62k   14.74k    69.67%
  1217635 requests in 30.03s, 1.24GB read
  Socket errors: connect 253, read 80, write 0, timeout 0
Requests/sec:  40552.17
Transfer/sec:     42.23MB

> wrk -t4 -c500 -d30s 'http://localhost:8080/v0/png?title=&author=&photo=http://localhost:8080/assets/favicon.svg&url=&theme=default'
  4 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.25s   240.14ms   1.98s    75.39%
    Req/Sec    48.22     21.55   120.00     67.28%
  5779 requests in 30.09s, 514.70MB read
  Socket errors: connect 253, read 98, write 0, timeout 1
Requests/sec:    192.04
Transfer/sec:     17.10MB

Bench Docker on Linux

lto = true

wrk -t4 -c500 -d30s 'http://localhost:8080/health'
Requests/sec:  42151.69
Transfer/sec:      4.94MB

wrk -t4 -c500 -d30s 'http://localhost:8080/v0/svg?title=&author=&photo=http://localhost:8080/assets/favicon.svg&url=&theme=default'
Requests/sec:  14038.25
Transfer/sec:     14.62MB

wrk -t4 -c500 -d30s 'http://localhost:8080/v0/png?title=&author=&photo=http://localhost:8080/assets/favicon.svg&url=&theme=default'
Requests/sec:    108.16
Transfer/sec:      9.63MB

lto = false

wrk -t4 -c500 -d30s 'http://localhost:8080/health'
Requests/sec:  46748.78
Transfer/sec:      5.48MB

wrk -t4 -c500 -d30s 'http://localhost:8080/v0/svg?title=&author=&photo=http://localhost:8080/assets/favicon.svg&url=&theme=default'
Requests/sec:  12905.99
Transfer/sec:     13.44MB

wrk -t4 -c500 -d30s 'http://localhost:8080/v0/png?title=&author=&photo=http://localhost:8080/assets/favicon.svg&url=&theme=default'
Requests/sec:    106.59
Transfer/sec:      9.49MB

Fin

The speed for ping request is kind of same, SVG generation works ~5% faster in lto=true, PNG rendering actually same time. CI time for lto is 6x longer.

It doesn't look to me that it makes sense for this program to enable lto by default in the supplied Dockerfile.

Btw, I added CARGO_PROFILE_RELEASE_LTO build arg to Dockerfile, can be used like:

docker build -t ogp --build-arg CARGO_PROFILE_RELEASE_LTO=true .

@zamazan4ik
Copy link
Author

Thank you a lot for the tests!

The speed for ping request is kind of same, SVG generation works ~5% faster in lto=true, PNG rendering actually same time. CI time for lto is 6x longer. It doesn't look to me that it makes sense for this program to enable lto by default in the supplied Dockerfile.

Yeah, if such a build-time overhead is important, I don't think that it makes a huge sense to enable LTO in the default Release profile. As a possible mitigation, we can create a dedicated heavy-release profile with LTO enabled in it. If users choose to exchange more build time for a slightly faster binary - they will be able to do it via choosing just this profile without manual LTO enabling. In the future, we can put other heavy optimization to this profile.

Anyway, your way with CARGO_PROFILE_RELEASE_LTO is also a viable option if it's visible to users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants