Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve compression speed on small blocks #4165

Merged
merged 10 commits into from
Oct 11, 2024
Merged

Improve compression speed on small blocks #4165

merged 10 commits into from
Oct 11, 2024

Conversation

Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented Oct 8, 2024

This is a follow up from #4144 , from @TocarIP .

#4144 improves compression speed on small blocks, by forcing usage of a cmov via assembly code (x64 only).
It works great on small blocks, but since the strategy is applied unconditionally, it's also active for long inputs. In which case, it introduces a speed regression (- 3-6%).

This PR is a follow up, which attempts to bring the best of both worlds:
instantiate 2 variants, one using cmov and the other using a branch,
and select the appropriate variant at runtime.

This PR implements this strategy for the fast strategy.

It works fine, on gcc.

Initially clang didn't like it unfortunately.

Scenario: silesia.tar, cut into blocks of 32 KB, i7-9700k (no turbo), Ubuntu 24.04:

compiler dev #4144 this PR
gcc-9 280 MB/s 342 MB/s 343 MB/s
clang-16 281 MB/s 342 MB/s 271 MB/s

After investigation, it appears that once the initial formulation if (test1 && test2) { ... } is changed into if (inlined_test(...) { ... } with inlined_test() { return test1 && test2; }, then the order of tests (test1 then test2) is no longer preserved. But, because of inlining, they still generate individual branches ! So clang was re-ordering the branches, making the range check branch first, thus negating the benefits of #4144.

One solution, suggested by @terrelln, is to use a memory barrier to force execution in the written order.
This suggestion has worked fine :

compiler dev #4144 this PR
gcc-9 280 MB/s 342 MB/s 342 MB/s
clang-16 281 MB/s 342 MB/s 342 MB/s

On the other hand, with this PR, performance on longer inputs is no longer negatively impacted:

Scenario: silesia.tar, entire file, i7-9700k (no turbo), Ubuntu 24.04:

compiler dev #4144 this PR
gcc-9 386 MB/s 368 MB/s 384 MB/s
clang-16 379 MB/s 368 MB/s 386 MB/s

edit :
As a good surprise, the current code seems to work fine as is on Apple's M1, without a need for some inline assembly.
It's unclear though if it does generalize well over all aarch64 architectures and compilers.

Scenario: silesia.tar, cut into blocks of 32 KB, M1 Pro, Ubuntu 24.04:

cpu dev #4144 this PR
M1 Pro 410 MB/s 408 MB/s 535 MB/s

note 2:
The negative impact of the cmov patch on compression speed for long inputs
is much less pronounced at level 3 than it was at level 1,
somewhere in the -1/-3% range.
The speed benefits on small data though are well present (> +10% at 32 KB blocks).

Scenario: silesia.tar, cut into blocks of 32 KB, level 3:

cpu dev #4144 this PR
i7-9700k 283 MB/s 311 MB/s 317 MB/s
M1 Pro 336 MB/s 408 MB/s 408 MB/s

Consequently, I believe it's not worth the time and binary size to generate some special dfast variant for "long inputs".

Potential follow up:
the current patch applies the new strategy to the noDict variants of fast and dfast only.
It seems interesting to also test and eventually adopt this strategy for dictionary compression too.

TocarIP and others added 4 commits September 20, 2024 16:07
Avoid unpredictable branch. Use conditional move to generate the address
that is guaranteed to be safe and compare unconditionally.
Instead of

if (idx < limit && x[idx] == val ) // mispredicted idx < limit branch

Do

addr = cmov(safe,x+idx)
if (*addr == val && idx < limit) // almost always false so well predicted

Using microbenchmarks from https://github.com/google/fleetbench,
I get about ~10% speed-up:

name                                                                                          old cpu/op   new cpu/op    delta
BM_ZSTD_COMPRESS_Fleet/compression_level:-7/window_log:15                                     1.46ns ± 3%   1.31ns ± 7%   -9.88%  (p=0.000 n=35+38)
BM_ZSTD_COMPRESS_Fleet/compression_level:-7/window_log:16                                     1.41ns ± 3%   1.28ns ± 3%   -9.56%  (p=0.000 n=36+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-5/window_log:15                                     1.61ns ± 1%   1.43ns ± 3%  -10.70%  (p=0.000 n=30+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-5/window_log:16                                     1.54ns ± 2%   1.39ns ± 3%   -9.21%  (p=0.000 n=37+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-3/window_log:15                                     1.82ns ± 2%   1.61ns ± 3%  -11.31%  (p=0.000 n=37+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:-3/window_log:16                                     1.73ns ± 3%   1.56ns ± 3%   -9.50%  (p=0.000 n=38+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-1/window_log:15                                     2.12ns ± 2%   1.79ns ± 3%  -15.55%  (p=0.000 n=34+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:-1/window_log:16                                     1.99ns ± 3%   1.72ns ± 3%  -13.70%  (p=0.000 n=38+38)
BM_ZSTD_COMPRESS_Fleet/compression_level:0/window_log:15                                      3.22ns ± 3%   2.94ns ± 3%   -8.67%  (p=0.000 n=38+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:0/window_log:16                                      3.19ns ± 4%   2.86ns ± 4%  -10.55%  (p=0.000 n=40+38)
BM_ZSTD_COMPRESS_Fleet/compression_level:1/window_log:15                                      2.60ns ± 3%   2.22ns ± 3%  -14.53%  (p=0.000 n=40+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:1/window_log:16                                      2.46ns ± 3%   2.13ns ± 2%  -13.67%  (p=0.000 n=39+36)
BM_ZSTD_COMPRESS_Fleet/compression_level:2/window_log:15                                      2.69ns ± 3%   2.46ns ± 3%   -8.63%  (p=0.000 n=37+39)
BM_ZSTD_COMPRESS_Fleet/compression_level:2/window_log:16                                      2.63ns ± 3%   2.36ns ± 3%  -10.47%  (p=0.000 n=40+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:3/window_log:15                                      3.20ns ± 2%   2.95ns ± 3%   -7.94%  (p=0.000 n=35+40)
BM_ZSTD_COMPRESS_Fleet/compression_level:3/window_log:16                                      3.20ns ± 4%   2.87ns ± 4%  -10.33%  (p=0.000 n=40+40)

I've also measured the impact on internal workloads and saw similar
~10% improvement in performance, measured by cpu usage/byte of data.
make hot variables more local
for easier swapping with a parameter
between cmov and branch
and use a simple heuristic based on wlog to select between them.

note: performance is not good on clang (yet)
Copy link
Contributor

@felixhandte felixhandte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been banging my head against this optimization problem as well. I haven't gotten something globally better, but I have found one trick you may want to include:

I was looking at the disassembly of my version of this in zstd_dfast and I saw that it was using a separate register for matchAddress than for the match pointer in the calling frame. It does this, I guess, because the matchAddress gets potentially mutated by the cmov, but the calling frame's match isn't supposed to be. And the unmodified value of match is used subsequently, if a match is found.

But because match is always only used subsequently if a match is found, and because the cmov only changes the value of matchAddress if the match is invalid, it's actually fine for these to be the same register, and for us to clobber the caller's match if the match fails.

We can tell the compiler this is ok by rewriting these functions to take the matchAddress by reference rather than by value, and by writing back the result of the cmov into the pointed-to match variable.

E.g.,

MEM_STATIC int
ZSTD_findMatch_cmov4(const BYTE* currentPtr, const BYTE** matchAddressPtr, U32 matchIdx, U32 lowLimit)
{
    const BYTE* matchAddress = *matchAddressPtr;  // <----
    char valid;
    __asm__ (
        "cmp %2, %3\n"
        "setb %1\n"
        "cmovae %4, %0"
        : "+r"(matchAddress), "=r"(valid)
        : "r"(matchIdx), "r"(lowLimit), "r"(currentPtr)
    );
    *matchAddressPtr = matchAddress;  // <----
    return MEM_read32(currentPtr) == MEM_read32(matchAddress) && valid;
}

called like

FORCE_INLINE_TEMPLATE
ZSTD_ALLOW_POINTER_OVERFLOW_ATTR
size_t ZSTD_compressBlock_doubleFast_noDict_generic(
        ZSTD_matchState_t* ms, seqStore_t* seqStore, U32 rep[ZSTD_REP_NUM],
        void const* src, size_t srcSize, U32 const mls /* template */)
{
    ...
    if (ZSTD_findMatch_cmov4(ip, &matchs0, idxs0, prefixLowestIndex)) {
        goto _search_next_long;
    }
    ...
}

This avoids a needless spill & reload that otherwise gets triggered by using an additional register.

@Cyan4973
Copy link
Contributor Author

I've been banging my head against this optimization problem as well. I haven't gotten something globally better, but I have found one trick you may want to include:

I was looking at the disassembly of my version of this in zstd_dfast and I saw that it was using a separate register for matchAddress than for the match pointer in the calling frame. It does this, I guess, because the matchAddress gets potentially mutated by the cmov, but the calling frame's match isn't supposed to be. And the unmodified value of match is used subsequently, if a match is found.

But because match is always only used subsequently if a match is found, and because the cmov only changes the value of matchAddress if the match is invalid, it's actually fine for these to be the same register, and for us to clobber the caller's match if the match fails.

We can tell the compiler this is ok by rewriting these functions to take the matchAddress by reference rather than by value, and by writing back the result of the cmov into the pointed-to match variable.

E.g.,

MEM_STATIC int
ZSTD_findMatch_cmov4(const BYTE* currentPtr, const BYTE** matchAddressPtr, U32 matchIdx, U32 lowLimit)
{
    const BYTE* matchAddress = *matchAddressPtr;  // <----
    char valid;
    __asm__ (
        "cmp %2, %3\n"
        "setb %1\n"
        "cmovae %4, %0"
        : "+r"(matchAddress), "=r"(valid)
        : "r"(matchIdx), "r"(lowLimit), "r"(currentPtr)
    );
    *matchAddressPtr = matchAddress;  // <----
    return MEM_read32(currentPtr) == MEM_read32(matchAddress) && valid;
}

called like

FORCE_INLINE_TEMPLATE
ZSTD_ALLOW_POINTER_OVERFLOW_ATTR
size_t ZSTD_compressBlock_doubleFast_noDict_generic(
        ZSTD_matchState_t* ms, seqStore_t* seqStore, U32 rep[ZSTD_REP_NUM],
        void const* src, size_t srcSize, U32 const mls /* template */)
{
    ...
    if (ZSTD_findMatch_cmov4(ip, &matchs0, idxs0, prefixLowestIndex)) {
        goto _search_next_long;
    }
    ...
}

This avoids a needless spill & reload that otherwise gets triggered by using an additional register.

I tried the proposed variant, but couldn't observe any performance benefit, tested across many compiler versions.

@Cyan4973 Cyan4973 marked this pull request as ready for review October 10, 2024 20:04
@Cyan4973
Copy link
Contributor Author

I tried to find some "more clever" heuristic to switch between cmov and branch (fast strategy only).

Typically, it seems that finding a candidate in range is more difficult at the beginning of each file.
Consequently, it would be interesting to start with the cmov strategy, and then switch to a branch once there's enough input to search into.

It doesn't work. Switching between these 2 types of tests is worse in benchmark than sticking to anyone of them.
The reason is not entirely clear at this stage, but the suspicion is that switching fast function instance mid-flight is less good for the instruction cache and for branch prediction.

Consequently, I'm keeping the current heuristic, which selects the branch test when wlog >= 19, and switch to cmov for any wlog value lower than that.

@Cyan4973 Cyan4973 changed the title Improve compression speed on small blocks (WIP) Improve compression speed on small blocks Oct 10, 2024
* However expression below complies into conditional move. Since
* match is unlikely and we only *branch* on idxl0 > prefixLowestIndex
* if there is a match, all branches become predictable. */
matchl0_safe = ZSTD_selectAddr(prefixLowestIndex, idxl0, &dummy[0], matchl0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these arguments are reversed. It works, but is weird.

If the ordering is required for performance, we should rename these arguments.

Suggested change
matchl0_safe = ZSTD_selectAddr(prefixLowestIndex, idxl0, &dummy[0], matchl0);
matchl0_safe = ZSTD_selectAddr(idxl0, prefixLowestIndex, matchl0, &dummy[0]);

goto _search_next_long;
}
/* Same optimization as matchl0 above */
matchs0_safe = ZSTD_selectAddr(prefixLowestIndex, idxs0, &dummy[0], matchs0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
matchs0_safe = ZSTD_selectAddr(prefixLowestIndex, idxs0, &dummy[0], matchs0);
matchs0_safe = ZSTD_selectAddr(idxs0, prefixLowestIndex, matchs0, &dummy[0]);

/* Array of ~random data, should have low probability of matching data.
* Load from here if the index is invalid.
* Used to avoid unpredictable branches. */
static const BYTE dummy[] = {0x12,0x34,0x56,0x78 };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static const BYTE dummy[] = {0x12,0x34,0x56,0x78 };
static const BYTE dummy[] = {0x12,0x34,0x56,0x78};

@@ -190,6 +232,7 @@ size_t ZSTD_compressBlock_fast_noDict_generic(
size_t step;
const BYTE* nextStep;
const size_t kStepIncr = (1 << (kSearchStrength - 1));
const ZSTD_match4Found findMatch = useCmov ? ZSTD_match4Found_cmov : ZSTD_match4Found_branch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Rename to foundMatch(), because it isn't actively searching for a match.

Suggested change
const ZSTD_match4Found findMatch = useCmov ? ZSTD_match4Found_cmov : ZSTD_match4Found_branch;
const ZSTD_match4Found foundMatch = useCmov ? ZSTD_match4Found_cmov : ZSTD_match4Found_branch;

findMatch -> matchFound
since it's a test, as opposed to an active search operation.
suggested by @terrelln
@Cyan4973
Copy link
Contributor Author

All suggestions integrated

@Cyan4973 Cyan4973 merged commit 8c38bda into dev Oct 11, 2024
91 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants