Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd: Less branching in asm version of adjustOffset #897

Closed
wants to merge 1 commit into from

Conversation

greatroar
Copy link
Contributor

@greatroar greatroar commented Dec 10, 2023

We can do most of the work with conditional moves and we can handle the case offset != 0 with a single comparison against 2.

Also, shorten a conditional move in adjustOffsetInMemory from a MOVQ+CMOVQEQ to an ADCQ.

Benchmark results, Ivy Bridge:

                                       │     old      │                cmov              │
                                       │     B/s      │     B/s       vs base               │
Decoder_DecodeAll/kppkn.gtb.zst-8        447.7Mi ± 2%   445.0Mi ± 0%       ~ (p=0.355 n=20)
Decoder_DecodeAll/geo.protodata.zst-8    1.147Gi ± 0%   1.152Gi ± 1%  +0.39% (p=0.012 n=20)
Decoder_DecodeAll/plrabn12.txt.zst-8     354.0Mi ± 1%   351.6Mi ± 1%       ~ (p=0.076 n=20)
Decoder_DecodeAll/lcet10.txt.zst-8       424.9Mi ± 1%   420.2Mi ± 0%  -1.10% (p=0.000 n=20)
Decoder_DecodeAll/asyoulik.txt.zst-8     351.4Mi ± 1%   350.5Mi ± 0%       ~ (p=1.000 n=20)
Decoder_DecodeAll/alice29.txt.zst-8      350.5Mi ± 1%   349.5Mi ± 1%       ~ (p=0.525 n=20)
Decoder_DecodeAll/html_x_4.zst-8         1.437Gi ± 1%   1.440Gi ± 0%       ~ (p=0.607 n=20)
Decoder_DecodeAll/paper-100k.pdf.zst-8   4.193Gi ± 1%   4.219Gi ± 1%  +0.60% (p=0.007 n=20)
Decoder_DecodeAll/fireworks.jpeg.zst-8   8.878Gi ± 0%   8.798Gi ± 0%  -0.89% (p=0.000 n=20)
Decoder_DecodeAll/urls.10K.zst-8         596.6Mi ± 1%   596.0Mi ± 1%       ~ (p=0.337 n=20)
Decoder_DecodeAll/html.zst-8             933.2Mi ± 0%   936.1Mi ± 1%       ~ (p=0.659 n=20)
Decoder_DecodeAll/comp-data.bin.zst-8    394.0Mi ± 1%   390.9Mi ± 1%       ~ (p=0.149 n=20)
geomean                                  839.6Mi        837.4Mi       -0.26%

Tiger Lake (laptop):

                                        │     old      │                 cmov                 │
                                        │     B/s      │      B/s       vs base               │
Decoder_DecodeAll/kppkn.gtb.zst-16        408.2Mi ± 1%    416.9Mi ± 2%  +2.13% (p=0.000 n=20)
Decoder_DecodeAll/geo.protodata.zst-16    1.291Gi ± 1%    1.306Gi ± 2%       ~ (p=0.277 n=20)
Decoder_DecodeAll/plrabn12.txt.zst-16     332.9Mi ± 2%    339.3Mi ± 2%  +1.95% (p=0.030 n=20)
Decoder_DecodeAll/lcet10.txt.zst-16       394.1Mi ± 1%    398.6Mi ± 1%  +1.16% (p=0.006 n=20)
Decoder_DecodeAll/asyoulik.txt.zst-16     348.2Mi ± 1%    350.0Mi ± 0%       ~ (p=0.101 n=20)
Decoder_DecodeAll/alice29.txt.zst-16      320.1Mi ± 2%    319.5Mi ± 1%       ~ (p=0.265 n=20)
Decoder_DecodeAll/html_x_4.zst-16         2.109Gi ± 1%    2.138Gi ± 2%       ~ (p=0.068 n=20)
Decoder_DecodeAll/paper-100k.pdf.zst-16   5.258Gi ± 1%    5.291Gi ± 1%       ~ (p=0.121 n=20)
Decoder_DecodeAll/fireworks.jpeg.zst-16   12.40Gi ± 1%    12.54Gi ± 1%  +1.08% (p=0.004 n=20)
Decoder_DecodeAll/urls.10K.zst-16         553.6Mi ± 1%    563.5Mi ± 2%  +1.80% (p=0.000 n=20)
Decoder_DecodeAll/html.zst-16             987.2Mi ± 1%   1004.3Mi ± 2%       ~ (p=0.068 n=20)
Decoder_DecodeAll/comp-data.bin.zst-16    492.3Mi ± 2%    496.1Mi ± 1%       ~ (p=0.417 n=20)
geomean                                   907.9Mi         918.5Mi       +1.17%

We can do most of the work with conditional moves and we can handle the
case offset != 0 with a single comparison against 2. The check for
temp == 0 can be done using a conditional move, but that gives a small
slowdown.

Also, shorten a conditional move in adjustOffsetInMemory from a
MOVQ+CMOVQEQ to an ADCQ.

Benchmark results, Ivy Bridge:

	                                       │     old      │                cmov                 │
	                                       │     B/s      │     B/s       vs base               │
	Decoder_DecodeAll/kppkn.gtb.zst-8        441.4Mi ± 2%   442.8Mi ± 1%  +0.32% (p=0.011 n=10)
	Decoder_DecodeAll/geo.protodata.zst-8    1.148Gi ± 1%   1.154Gi ± 1%  +0.49% (p=0.002 n=10)
	Decoder_DecodeAll/plrabn12.txt.zst-8     347.9Mi ± 0%   351.9Mi ± 0%  +1.15% (p=0.000 n=10)
	Decoder_DecodeAll/lcet10.txt.zst-8       417.4Mi ± 0%   420.2Mi ± 0%  +0.68% (p=0.000 n=10)
	Decoder_DecodeAll/asyoulik.txt.zst-8     347.1Mi ± 0%   350.4Mi ± 0%  +0.97% (p=0.000 n=10)
	Decoder_DecodeAll/alice29.txt.zst-8      346.3Mi ± 1%   347.3Mi ± 0%  +0.30% (p=0.023 n=10)
	Decoder_DecodeAll/html_x_4.zst-8         1.440Gi ± 0%   1.443Gi ± 1%       ~ (p=0.165 n=10)
	Decoder_DecodeAll/paper-100k.pdf.zst-8   4.191Gi ± 0%   4.197Gi ± 0%       ~ (p=0.739 n=10)
	Decoder_DecodeAll/fireworks.jpeg.zst-8   8.891Gi ± 0%   8.797Gi ± 0%  -1.05% (p=0.000 n=10)
	Decoder_DecodeAll/urls.10K.zst-8         589.6Mi ± 0%   596.1Mi ± 1%  +1.10% (p=0.015 n=10)
	Decoder_DecodeAll/html.zst-8             926.1Mi ± 1%   935.7Mi ± 0%  +1.03% (p=0.000 n=10)
	Decoder_DecodeAll/comp-data.bin.zst-8    389.6Mi ± 0%   395.4Mi ± 0%  +1.49% (p=0.000 n=10)
	geomean                                  832.6Mi        837.4Mi       +0.57%

Tiger Lake (laptop):

	                                        │     old      │                 cmov                 │
	                                        │     B/s      │      B/s       vs base               │
	Decoder_DecodeAll/kppkn.gtb.zst-16        408.2Mi ± 1%    416.9Mi ± 2%  +2.13% (p=0.000 n=20)
	Decoder_DecodeAll/geo.protodata.zst-16    1.291Gi ± 1%    1.306Gi ± 2%       ~ (p=0.277 n=20)
	Decoder_DecodeAll/plrabn12.txt.zst-16     332.9Mi ± 2%    339.3Mi ± 2%  +1.95% (p=0.030 n=20)
	Decoder_DecodeAll/lcet10.txt.zst-16       394.1Mi ± 1%    398.6Mi ± 1%  +1.16% (p=0.006 n=20)
	Decoder_DecodeAll/asyoulik.txt.zst-16     348.2Mi ± 1%    350.0Mi ± 0%       ~ (p=0.101 n=20)
	Decoder_DecodeAll/alice29.txt.zst-16      320.1Mi ± 2%    319.5Mi ± 1%       ~ (p=0.265 n=20)
	Decoder_DecodeAll/html_x_4.zst-16         2.109Gi ± 1%    2.138Gi ± 2%       ~ (p=0.068 n=20)
	Decoder_DecodeAll/paper-100k.pdf.zst-16   5.258Gi ± 1%    5.291Gi ± 1%       ~ (p=0.121 n=20)
	Decoder_DecodeAll/fireworks.jpeg.zst-16   12.40Gi ± 1%    12.54Gi ± 1%  +1.08% (p=0.004 n=20)
	Decoder_DecodeAll/urls.10K.zst-16         553.6Mi ± 1%    563.5Mi ± 2%  +1.80% (p=0.000 n=20)
	Decoder_DecodeAll/html.zst-16             987.2Mi ± 1%   1004.3Mi ± 2%       ~ (p=0.068 n=20)
	Decoder_DecodeAll/comp-data.bin.zst-16    492.3Mi ± 2%    496.1Mi ± 1%       ~ (p=0.417 n=20)
	geomean                                   907.9Mi         918.5Mi       +1.17%
@greatroar greatroar marked this pull request as draft December 10, 2023 12:44
@greatroar
Copy link
Contributor Author

Hold on. I did something wrong with the benchmark figures on Ivy Bridge. Running some more.

@greatroar
Copy link
Contributor Author

Corrected benchmark figures for Ivy Bridge posted above. The benefit of this is a bit dubious...

@klauspost
Copy link
Owner

Let me test it out. Generally I don't observe any benefit from fairly predictable CMOVs. But since you have older HW to test on we can take it in if there are no regressions.

@klauspost
Copy link
Owner

My benchmarks are far too noisy to see anything significant.

@greatroar greatroar closed this Jan 13, 2024
@greatroar greatroar deleted the cmov branch January 13, 2024 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants