Improve compression speed on small blocks #4165

Cyan4973 · 2024-10-08T22:54:00Z

This is a follow up from #4144 , from @TocarIP .

#4144 improves compression speed on small blocks, by forcing usage of a cmov via assembly code (x64 only).
It works great on small blocks, but since the strategy is applied unconditionally, it's also active for long inputs. In which case, it introduces a speed regression (- 3-6%).

This PR is a follow up, which attempts to bring the best of both worlds:
instantiate 2 variants, one using cmov and the other using a branch,
and select the appropriate variant at runtime.

This PR implements this strategy for the fast strategy.

It works fine, on gcc.

Initially clang didn't like it unfortunately.

Scenario: silesia.tar, cut into blocks of 32 KB, i7-9700k (no turbo), Ubuntu 24.04:

compiler	`dev`	#4144	this PR
gcc-9	280 MB/s	342 MB/s	343 MB/s
clang-16	281 MB/s	342 MB/s	271 MB/s

After investigation, it appears that once the initial formulation if (test1 && test2) { ... } is changed into if (inlined_test(...) { ... } with inlined_test() { return test1 && test2; }, then the order of tests (test1 then test2) is no longer preserved. But, because of inlining, they still generate individual branches ! So clang was re-ordering the branches, making the range check branch first, thus negating the benefits of #4144.

One solution, suggested by @terrelln, is to use a memory barrier to force execution in the written order.
This suggestion has worked fine :

compiler	`dev`	#4144	this PR
gcc-9	280 MB/s	342 MB/s	342 MB/s
clang-16	281 MB/s	342 MB/s	342 MB/s

On the other hand, with this PR, performance on longer inputs is no longer negatively impacted:

Scenario: silesia.tar, entire file, i7-9700k (no turbo), Ubuntu 24.04:

compiler	`dev`	#4144	this PR
gcc-9	386 MB/s	368 MB/s	384 MB/s
clang-16	379 MB/s	368 MB/s	386 MB/s

edit :
As a good surprise, the current code seems to work fine as is on Apple's M1, without a need for some inline assembly.
It's unclear though if it does generalize well over all aarch64 architectures and compilers.

Scenario: silesia.tar, cut into blocks of 32 KB, M1 Pro, Ubuntu 24.04:

cpu	`dev`	#4144	this PR
M1 Pro	410 MB/s	408 MB/s	535 MB/s

note 2:
The negative impact of the cmov patch on compression speed for long inputs
is much less pronounced at level 3 than it was at level 1,
somewhere in the -1/-3% range.
The speed benefits on small data though are well present (> +10% at 32 KB blocks).

Scenario: silesia.tar, cut into blocks of 32 KB, level 3:

cpu	`dev`	#4144	this PR
i7-9700k	283 MB/s	311 MB/s	317 MB/s
M1 Pro	336 MB/s	408 MB/s	408 MB/s

Consequently, I believe it's not worth the time and binary size to generate some special dfast variant for "long inputs".

Potential follow up:
the current patch applies the new strategy to the noDict variants of fast and dfast only.
It seems interesting to also test and eventually adopt this strategy for dictionary compression too.

Avoid unpredictable branch. Use conditional move to generate the address that is guaranteed to be safe and compare unconditionally. Instead of if (idx < limit && x[idx] == val ) // mispredicted idx < limit branch Do addr = cmov(safe,x+idx) if (*addr == val && idx < limit) // almost always false so well predicted Using microbenchmarks from https://github.com/google/fleetbench, I get about ~10% speed-up: name old cpu/op new cpu/op delta BM_ZSTD_COMPRESS_Fleet/compression_level:-7/window_log:15 1.46ns ± 3% 1.31ns ± 7% -9.88% (p=0.000 n=35+38) BM_ZSTD_COMPRESS_Fleet/compression_level:-7/window_log:16 1.41ns ± 3% 1.28ns ± 3% -9.56% (p=0.000 n=36+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-5/window_log:15 1.61ns ± 1% 1.43ns ± 3% -10.70% (p=0.000 n=30+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-5/window_log:16 1.54ns ± 2% 1.39ns ± 3% -9.21% (p=0.000 n=37+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-3/window_log:15 1.82ns ± 2% 1.61ns ± 3% -11.31% (p=0.000 n=37+40) BM_ZSTD_COMPRESS_Fleet/compression_level:-3/window_log:16 1.73ns ± 3% 1.56ns ± 3% -9.50% (p=0.000 n=38+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-1/window_log:15 2.12ns ± 2% 1.79ns ± 3% -15.55% (p=0.000 n=34+39) BM_ZSTD_COMPRESS_Fleet/compression_level:-1/window_log:16 1.99ns ± 3% 1.72ns ± 3% -13.70% (p=0.000 n=38+38) BM_ZSTD_COMPRESS_Fleet/compression_level:0/window_log:15 3.22ns ± 3% 2.94ns ± 3% -8.67% (p=0.000 n=38+40) BM_ZSTD_COMPRESS_Fleet/compression_level:0/window_log:16 3.19ns ± 4% 2.86ns ± 4% -10.55% (p=0.000 n=40+38) BM_ZSTD_COMPRESS_Fleet/compression_level:1/window_log:15 2.60ns ± 3% 2.22ns ± 3% -14.53% (p=0.000 n=40+39) BM_ZSTD_COMPRESS_Fleet/compression_level:1/window_log:16 2.46ns ± 3% 2.13ns ± 2% -13.67% (p=0.000 n=39+36) BM_ZSTD_COMPRESS_Fleet/compression_level:2/window_log:15 2.69ns ± 3% 2.46ns ± 3% -8.63% (p=0.000 n=37+39) BM_ZSTD_COMPRESS_Fleet/compression_level:2/window_log:16 2.63ns ± 3% 2.36ns ± 3% -10.47% (p=0.000 n=40+40) BM_ZSTD_COMPRESS_Fleet/compression_level:3/window_log:15 3.20ns ± 2% 2.95ns ± 3% -7.94% (p=0.000 n=35+40) BM_ZSTD_COMPRESS_Fleet/compression_level:3/window_log:16 3.20ns ± 4% 2.87ns ± 4% -10.33% (p=0.000 n=40+40) I've also measured the impact on internal workloads and saw similar ~10% improvement in performance, measured by cpu usage/byte of data.

make hot variables more local

for easier swapping with a parameter

between cmov and branch and use a simple heuristic based on wlog to select between them. note: performance is not good on clang (yet)

@terrelln

suggested by @terrelln

feels more logical, better contained

felixhandte

I've been banging my head against this optimization problem as well. I haven't gotten something globally better, but I have found one trick you may want to include:

I was looking at the disassembly of my version of this in zstd_dfast and I saw that it was using a separate register for matchAddress than for the match pointer in the calling frame. It does this, I guess, because the matchAddress gets potentially mutated by the cmov, but the calling frame's match isn't supposed to be. And the unmodified value of match is used subsequently, if a match is found.

But because match is always only used subsequently if a match is found, and because the cmov only changes the value of matchAddress if the match is invalid, it's actually fine for these to be the same register, and for us to clobber the caller's match if the match fails.

We can tell the compiler this is ok by rewriting these functions to take the matchAddress by reference rather than by value, and by writing back the result of the cmov into the pointed-to match variable.

E.g.,

MEM_STATIC int
ZSTD_findMatch_cmov4(const BYTE* currentPtr, const BYTE** matchAddressPtr, U32 matchIdx, U32 lowLimit)
{
    const BYTE* matchAddress = *matchAddressPtr;  // <----
    char valid;
    __asm__ (
        "cmp %2, %3\n"
        "setb %1\n"
        "cmovae %4, %0"
        : "+r"(matchAddress), "=r"(valid)
        : "r"(matchIdx), "r"(lowLimit), "r"(currentPtr)
    );
    *matchAddressPtr = matchAddress;  // <----
    return MEM_read32(currentPtr) == MEM_read32(matchAddress) && valid;
}

called like

FORCE_INLINE_TEMPLATE
ZSTD_ALLOW_POINTER_OVERFLOW_ATTR
size_t ZSTD_compressBlock_doubleFast_noDict_generic(
        ZSTD_matchState_t* ms, seqStore_t* seqStore, U32 rep[ZSTD_REP_NUM],
        void const* src, size_t srcSize, U32 const mls /* template */)
{
    ...
    if (ZSTD_findMatch_cmov4(ip, &matchs0, idxs0, prefixLowestIndex)) {
        goto _search_next_long;
    }
    ...
}

This avoids a needless spill & reload that otherwise gets triggered by using an additional register.

Cyan4973 · 2024-10-10T19:34:46Z

I've been banging my head against this optimization problem as well. I haven't gotten something globally better, but I have found one trick you may want to include:

I was looking at the disassembly of my version of this in zstd_dfast and I saw that it was using a separate register for matchAddress than for the match pointer in the calling frame. It does this, I guess, because the matchAddress gets potentially mutated by the cmov, but the calling frame's match isn't supposed to be. And the unmodified value of match is used subsequently, if a match is found.

But because match is always only used subsequently if a match is found, and because the cmov only changes the value of matchAddress if the match is invalid, it's actually fine for these to be the same register, and for us to clobber the caller's match if the match fails.

We can tell the compiler this is ok by rewriting these functions to take the matchAddress by reference rather than by value, and by writing back the result of the cmov into the pointed-to match variable.

E.g.,
MEM_STATIC int
ZSTD_findMatch_cmov4(const BYTE* currentPtr, const BYTE** matchAddressPtr, U32 matchIdx, U32 lowLimit)
{
    const BYTE* matchAddress = *matchAddressPtr;  // <----
    char valid;
    __asm__ (
        "cmp %2, %3\n"
        "setb %1\n"
        "cmovae %4, %0"
        : "+r"(matchAddress), "=r"(valid)
        : "r"(matchIdx), "r"(lowLimit), "r"(currentPtr)
    );
    *matchAddressPtr = matchAddress;  // <----
    return MEM_read32(currentPtr) == MEM_read32(matchAddress) && valid;
}
called like
FORCE_INLINE_TEMPLATE
ZSTD_ALLOW_POINTER_OVERFLOW_ATTR
size_t ZSTD_compressBlock_doubleFast_noDict_generic(
        ZSTD_matchState_t* ms, seqStore_t* seqStore, U32 rep[ZSTD_REP_NUM],
        void const* src, size_t srcSize, U32 const mls /* template */)
{
    ...
    if (ZSTD_findMatch_cmov4(ip, &matchs0, idxs0, prefixLowestIndex)) {
        goto _search_next_long;
    }
    ...
}
This avoids a needless spill & reload that otherwise gets triggered by using an additional register.

I tried the proposed variant, but couldn't observe any performance benefit, tested across many compiler versions.

Cyan4973 · 2024-10-10T20:09:47Z

I tried to find some "more clever" heuristic to switch between cmov and branch (fast strategy only).

Typically, it seems that finding a candidate in range is more difficult at the beginning of each file.
Consequently, it would be interesting to start with the cmov strategy, and then switch to a branch once there's enough input to search into.

It doesn't work. Switching between these 2 types of tests is worse in benchmark than sticking to anyone of them.
The reason is not entirely clear at this stage, but the suspicion is that switching fast function instance mid-flight is less good for the instruction cache and for branch prediction.

Consequently, I'm keeping the current heuristic, which selects the branch test when wlog >= 19, and switch to cmov for any wlog value lower than that.

terrelln · 2024-10-11T18:16:56Z

lib/compress/zstd_double_fast.c

+             * However expression below complies into conditional move. Since
+             * match is unlikely and we only *branch* on idxl0 > prefixLowestIndex
+             * if there is a match, all branches become predictable. */
+            matchl0_safe = ZSTD_selectAddr(prefixLowestIndex, idxl0, &dummy[0], matchl0);


All these arguments are reversed. It works, but is weird.

If the ordering is required for performance, we should rename these arguments.

Suggested change

matchl0_safe = ZSTD_selectAddr(prefixLowestIndex, idxl0, &dummy[0], matchl0);

matchl0_safe = ZSTD_selectAddr(idxl0, prefixLowestIndex, matchl0, &dummy[0]);

terrelln · 2024-10-11T18:31:15Z

lib/compress/zstd_double_fast.c

-                    goto _search_next_long;
-                }
+            /* Same optimization as matchl0 above */
+            matchs0_safe = ZSTD_selectAddr(prefixLowestIndex, idxs0, &dummy[0], matchs0);


Suggested change

matchs0_safe = ZSTD_selectAddr(prefixLowestIndex, idxs0, &dummy[0], matchs0);

matchs0_safe = ZSTD_selectAddr(idxs0, prefixLowestIndex, matchs0, &dummy[0]);

terrelln · 2024-10-11T18:35:29Z

lib/compress/zstd_fast.c

+    /* Array of ~random data, should have low probability of matching data.
+     * Load from here if the index is invalid.
+     * Used to avoid unpredictable branches. */
+    static const BYTE dummy[] = {0x12,0x34,0x56,0x78 };


Suggested change

static const BYTE dummy[] = {0x12,0x34,0x56,0x78 };

static const BYTE dummy[] = {0x12,0x34,0x56,0x78};

terrelln · 2024-10-11T18:40:29Z

lib/compress/zstd_fast.c

@@ -190,6 +232,7 @@ size_t ZSTD_compressBlock_fast_noDict_generic(
    size_t step;
    const BYTE* nextStep;
    const size_t kStepIncr = (1 << (kSearchStrength - 1));
+    const ZSTD_match4Found findMatch = useCmov ? ZSTD_match4Found_cmov : ZSTD_match4Found_branch;


nit: Rename to foundMatch(), because it isn't actively searching for a match.

Suggested change

const ZSTD_match4Found findMatch = useCmov ? ZSTD_match4Found_cmov : ZSTD_match4Found_branch;

const ZSTD_match4Found foundMatch = useCmov ? ZSTD_match4Found_cmov : ZSTD_match4Found_branch;

@terrelln

noticed by @terrelln

@terrelln

findMatch -> matchFound since it's a test, as opposed to an active search operation. suggested by @terrelln

Cyan4973 · 2024-10-11T22:39:10Z

All suggestions integrated

TocarIP and others added 4 commits September 20, 2024 16:07

minor refactor zstd_fast

1e7fa24

make hot variables more local

refactor search into an inline function

2cc600b

for easier swapping with a parameter

made search strategy switchable

186b132

between cmov and branch and use a simple heuristic based on wlog to select between them. note: performance is not good on clang (yet)

Cyan4973 self-assigned this Oct 8, 2024

facebook-github-bot added the CLA Signed label Oct 8, 2024

Cyan4973 added 2 commits October 8, 2024 15:54

introduce memory barrier to force test order

197c258

suggested by @terrelln

store dummy bytes within ZSTD_match4Found_cmov()

741b860

feels more logical, better contained

Cyan4973 force-pushed the cspeed_cmov branch from e4b0531 to 741b860 Compare October 8, 2024 23:34

make __asm__ a __GNUC__ specific

d45aee4

felixhandte reviewed Oct 9, 2024

View reviewed changes

Cyan4973 marked this pull request as ready for review October 10, 2024 20:04

Cyan4973 mentioned this pull request Oct 10, 2024

Optimize compression by avoiding unpredictable branches #4144

Closed

Cyan4973 changed the title ~~Improve compression speed on small blocks (WIP)~~ Improve compression speed on small blocks Oct 10, 2024

minor: better variable naming

fa1fcb0

terrelln approved these changes Oct 11, 2024

View reviewed changes

Cyan4973 added 2 commits October 11, 2024 15:36

fixed parameter ordering in dfast

83de003

noticed by @terrelln

rename variable name

8e5823b

findMatch -> matchFound since it's a test, as opposed to an active search operation. suggested by @terrelln

Cyan4973 merged commit 8c38bda into dev Oct 11, 2024
91 checks passed

Cyan4973 mentioned this pull request Oct 16, 2024

Improve dictionary compression speed #4170

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve compression speed on small blocks #4165

Improve compression speed on small blocks #4165

Cyan4973 commented Oct 8, 2024 •

edited

Loading

felixhandte left a comment

Cyan4973 commented Oct 10, 2024

Cyan4973 commented Oct 10, 2024

terrelln Oct 11, 2024

terrelln Oct 11, 2024

terrelln Oct 11, 2024

terrelln Oct 11, 2024

Cyan4973 commented Oct 11, 2024

	matchl0_safe = ZSTD_selectAddr(prefixLowestIndex, idxl0, &dummy[0], matchl0);
	matchl0_safe = ZSTD_selectAddr(idxl0, prefixLowestIndex, matchl0, &dummy[0]);

	matchs0_safe = ZSTD_selectAddr(prefixLowestIndex, idxs0, &dummy[0], matchs0);
	matchs0_safe = ZSTD_selectAddr(idxs0, prefixLowestIndex, matchs0, &dummy[0]);

	static const BYTE dummy[] = {0x12,0x34,0x56,0x78 };
	static const BYTE dummy[] = {0x12,0x34,0x56,0x78};

	const ZSTD_match4Found findMatch = useCmov ? ZSTD_match4Found_cmov : ZSTD_match4Found_branch;
	const ZSTD_match4Found foundMatch = useCmov ? ZSTD_match4Found_cmov : ZSTD_match4Found_branch;

Improve compression speed on small blocks #4165

Improve compression speed on small blocks #4165

Conversation

Cyan4973 commented Oct 8, 2024 • edited Loading

felixhandte left a comment

Choose a reason for hiding this comment

Cyan4973 commented Oct 10, 2024

Cyan4973 commented Oct 10, 2024

terrelln Oct 11, 2024

Choose a reason for hiding this comment

terrelln Oct 11, 2024

Choose a reason for hiding this comment

terrelln Oct 11, 2024

Choose a reason for hiding this comment

terrelln Oct 11, 2024

Choose a reason for hiding this comment

Cyan4973 commented Oct 11, 2024

Cyan4973 commented Oct 8, 2024 •

edited

Loading