-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve compression speed on small blocks #4165
Conversation
Avoid unpredictable branch. Use conditional move to generate the address that is guaranteed to be safe and compare unconditionally. Instead of if (idx < limit && x[idx] == val ) // mispredicted idx < limit branch Do addr = cmov(safe,x+idx) if (*addr == val && idx < limit) // almost always false so well predicted Using microbenchmarks from https://github.com/google/fleetbench, I get about ~10% speed-up: name old cpu/op new cpu/op delta BM_ZSTD_COMPRESS_Fleet/compression_level:-7/window_log:15 1.46ns ± 3% 1.31ns ± 7% -9.88% (p=0.000 n=35+38) BM_ZSTD_COMPRESS_Fleet/compression_level:-7/window_log:16 1.41ns ± 3% 1.28ns ± 3% -9.56% (p=0.000 n=36+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-5/window_log:15 1.61ns ± 1% 1.43ns ± 3% -10.70% (p=0.000 n=30+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-5/window_log:16 1.54ns ± 2% 1.39ns ± 3% -9.21% (p=0.000 n=37+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-3/window_log:15 1.82ns ± 2% 1.61ns ± 3% -11.31% (p=0.000 n=37+40) BM_ZSTD_COMPRESS_Fleet/compression_level:-3/window_log:16 1.73ns ± 3% 1.56ns ± 3% -9.50% (p=0.000 n=38+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-1/window_log:15 2.12ns ± 2% 1.79ns ± 3% -15.55% (p=0.000 n=34+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-1/window_log:16 1.99ns ± 3% 1.72ns ± 3% -13.70% (p=0.000 n=38+38) BM_ZSTD_COMPRESS_Fleet/compression_level:0/window_log:15 3.22ns ± 3% 2.94ns ± 3% -8.67% (p=0.000 n=38+40) BM_ZSTD_COMPRESS_Fleet/compression_level:0/window_log:16 3.19ns ± 4% 2.86ns ± 4% -10.55% (p=0.000 n=40+38) BM_ZSTD_COMPRESS_Fleet/compression_level:1/window_log:15 2.60ns ± 3% 2.22ns ± 3% -14.53% (p=0.000 n=40+39) BM_ZSTD_COMPRESS_Fleet/compression_level:1/window_log:16 2.46ns ± 3% 2.13ns ± 2% -13.67% (p=0.000 n=39+36) BM_ZSTD_COMPRESS_Fleet/compression_level:2/window_log:15 2.69ns ± 3% 2.46ns ± 3% -8.63% (p=0.000 n=37+39) BM_ZSTD_COMPRESS_Fleet/compression_level:2/window_log:16 2.63ns ± 3% 2.36ns ± 3% -10.47% (p=0.000 n=40+40) BM_ZSTD_COMPRESS_Fleet/compression_level:3/window_log:15 3.20ns ± 2% 2.95ns ± 3% -7.94% (p=0.000 n=35+40) BM_ZSTD_COMPRESS_Fleet/compression_level:3/window_log:16 3.20ns ± 4% 2.87ns ± 4% -10.33% (p=0.000 n=40+40) I've also measured the impact on internal workloads and saw similar ~10% improvement in performance, measured by cpu usage/byte of data.
make hot variables more local
for easier swapping with a parameter
between cmov and branch and use a simple heuristic based on wlog to select between them. note: performance is not good on clang (yet)
suggested by @terrelln
feels more logical, better contained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been banging my head against this optimization problem as well. I haven't gotten something globally better, but I have found one trick you may want to include:
I was looking at the disassembly of my version of this in zstd_dfast and I saw that it was using a separate register for matchAddress
than for the match
pointer in the calling frame. It does this, I guess, because the matchAddress
gets potentially mutated by the cmov
, but the calling frame's match
isn't supposed to be. And the unmodified value of match
is used subsequently, if a match is found.
But because match
is always only used subsequently if a match is found, and because the cmov
only changes the value of matchAddress
if the match is invalid, it's actually fine for these to be the same register, and for us to clobber the caller's match
if the match fails.
We can tell the compiler this is ok by rewriting these functions to take the matchAddress
by reference rather than by value, and by writing back the result of the cmov
into the pointed-to match
variable.
E.g.,
MEM_STATIC int
ZSTD_findMatch_cmov4(const BYTE* currentPtr, const BYTE** matchAddressPtr, U32 matchIdx, U32 lowLimit)
{
const BYTE* matchAddress = *matchAddressPtr; // <----
char valid;
__asm__ (
"cmp %2, %3\n"
"setb %1\n"
"cmovae %4, %0"
: "+r"(matchAddress), "=r"(valid)
: "r"(matchIdx), "r"(lowLimit), "r"(currentPtr)
);
*matchAddressPtr = matchAddress; // <----
return MEM_read32(currentPtr) == MEM_read32(matchAddress) && valid;
}
called like
FORCE_INLINE_TEMPLATE
ZSTD_ALLOW_POINTER_OVERFLOW_ATTR
size_t ZSTD_compressBlock_doubleFast_noDict_generic(
ZSTD_matchState_t* ms, seqStore_t* seqStore, U32 rep[ZSTD_REP_NUM],
void const* src, size_t srcSize, U32 const mls /* template */)
{
...
if (ZSTD_findMatch_cmov4(ip, &matchs0, idxs0, prefixLowestIndex)) {
goto _search_next_long;
}
...
}
This avoids a needless spill & reload that otherwise gets triggered by using an additional register.
I tried the proposed variant, but couldn't observe any performance benefit, tested across many compiler versions. |
I tried to find some "more clever" heuristic to switch between Typically, it seems that finding a candidate in range is more difficult at the beginning of each file. It doesn't work. Switching between these 2 types of tests is worse in benchmark than sticking to anyone of them. Consequently, I'm keeping the current heuristic, which selects the branch test when |
lib/compress/zstd_double_fast.c
Outdated
* However expression below complies into conditional move. Since | ||
* match is unlikely and we only *branch* on idxl0 > prefixLowestIndex | ||
* if there is a match, all branches become predictable. */ | ||
matchl0_safe = ZSTD_selectAddr(prefixLowestIndex, idxl0, &dummy[0], matchl0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these arguments are reversed. It works, but is weird.
If the ordering is required for performance, we should rename these arguments.
matchl0_safe = ZSTD_selectAddr(prefixLowestIndex, idxl0, &dummy[0], matchl0); | |
matchl0_safe = ZSTD_selectAddr(idxl0, prefixLowestIndex, matchl0, &dummy[0]); |
lib/compress/zstd_double_fast.c
Outdated
goto _search_next_long; | ||
} | ||
/* Same optimization as matchl0 above */ | ||
matchs0_safe = ZSTD_selectAddr(prefixLowestIndex, idxs0, &dummy[0], matchs0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matchs0_safe = ZSTD_selectAddr(prefixLowestIndex, idxs0, &dummy[0], matchs0); | |
matchs0_safe = ZSTD_selectAddr(idxs0, prefixLowestIndex, matchs0, &dummy[0]); |
lib/compress/zstd_fast.c
Outdated
/* Array of ~random data, should have low probability of matching data. | ||
* Load from here if the index is invalid. | ||
* Used to avoid unpredictable branches. */ | ||
static const BYTE dummy[] = {0x12,0x34,0x56,0x78 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static const BYTE dummy[] = {0x12,0x34,0x56,0x78 }; | |
static const BYTE dummy[] = {0x12,0x34,0x56,0x78}; |
lib/compress/zstd_fast.c
Outdated
@@ -190,6 +232,7 @@ size_t ZSTD_compressBlock_fast_noDict_generic( | |||
size_t step; | |||
const BYTE* nextStep; | |||
const size_t kStepIncr = (1 << (kSearchStrength - 1)); | |||
const ZSTD_match4Found findMatch = useCmov ? ZSTD_match4Found_cmov : ZSTD_match4Found_branch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Rename to foundMatch()
, because it isn't actively searching for a match.
const ZSTD_match4Found findMatch = useCmov ? ZSTD_match4Found_cmov : ZSTD_match4Found_branch; | |
const ZSTD_match4Found foundMatch = useCmov ? ZSTD_match4Found_cmov : ZSTD_match4Found_branch; |
All suggestions integrated |
This is a follow up from #4144 , from @TocarIP .
#4144 improves compression speed on small blocks, by forcing usage of a
cmov
via assembly code (x64
only).It works great on small blocks, but since the strategy is applied unconditionally, it's also active for long inputs. In which case, it introduces a speed regression (- 3-6%).
This PR is a follow up, which attempts to bring the best of both worlds:
instantiate 2 variants, one using
cmov
and the other using a branch,and select the appropriate variant at runtime.
This PR implements this strategy for the
fast
strategy.It works fine, on
gcc
.Initially
clang
didn't like it unfortunately.Scenario:
silesia.tar
, cut into blocks of 32 KB, i7-9700k (no turbo), Ubuntu 24.04:dev
After investigation, it appears that once the initial formulation
if (test1 && test2) { ... }
is changed intoif (inlined_test(...) { ... }
withinlined_test() { return test1 && test2; }
, then the order of tests (test1
thentest2
) is no longer preserved. But, because of inlining, they still generate individual branches ! Soclang
was re-ordering the branches, making the range check branch first, thus negating the benefits of #4144.One solution, suggested by @terrelln, is to use a memory barrier to force execution in the written order.
This suggestion has worked fine :
dev
On the other hand, with this PR, performance on longer inputs is no longer negatively impacted:
Scenario:
silesia.tar
, entire file, i7-9700k (no turbo), Ubuntu 24.04:dev
edit :
As a good surprise, the current code seems to work fine as is on Apple's M1, without a need for some inline assembly.
It's unclear though if it does generalize well over all
aarch64
architectures and compilers.Scenario:
silesia.tar
, cut into blocks of 32 KB, M1 Pro, Ubuntu 24.04:dev
note 2:
The negative impact of the
cmov
patch on compression speed for long inputsis much less pronounced at level 3 than it was at level 1,
somewhere in the -1/-3% range.
The speed benefits on small data though are well present (> +10% at 32 KB blocks).
Scenario:
silesia.tar
, cut into blocks of 32 KB, level3
:dev
Consequently, I believe it's not worth the time and binary size to generate some special
dfast
variant for "long inputs".Potential follow up:
the current patch applies the new strategy to the
noDict
variants offast
anddfast
only.It seems interesting to also test and eventually adopt this strategy for dictionary compression too.