Simplify ClippedReLU #5261

cj5716 · 2024-05-18T08:26:30Z

Removes some max calls

Some speedup stats, courtesy of @AndyGrant (albeit measured in an alternate implementation)
Dev 749240 nps
Base 748495 nps
Gain 0.100%
289936 games

STC:
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 203040 W: 52213 L: 52179 D: 98648
Ptnml(0-2): 480, 20722, 59139, 20642, 537
https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c

closes #5261

No functional change

Removes some max calls, and removes the need for SSSE3 specialisation. LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes official-stockfish#5261 No functional change

github-actions · 2024-05-18T08:27:03Z

clang-format 17 needs to be run on this PR.
If you do not have clang-format installed, the maintainer will run it when merging.
For the exact version please see https://packages.ubuntu.com/mantic/clang-format-17.

(execution 9146055348 / attempt 1)

@AndyGrant

Removes some max calls, and removes the need for SSSE3 specialisation. Some speedup stats, courtesy of @AndyGrant Dev 749240 nps Base 748495 nps Gain 0.100% 289936 games LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes official-stockfish#5261 No functional change

@AndyGrant

Removes some max calls, and removes the need for SSSE3 specialisation. Some speedup stats, courtesy of @AndyGrant Dev 749240 nps Base 748495 nps Gain 0.100% 289936 games LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes official-stockfish#5261 No functional change

@AndyGrant

Removes some max calls, and removes the need for SSSE3 specialisation. Some speedup stats, courtesy of @AndyGrant Dev 749240 nps Base 748495 nps Gain 0.100% 289936 games STC: LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes official-stockfish#5261 No functional change

@AndyGrant

Removes some max calls, and removes the need for SSSE3 specialisation. Some speedup stats, courtesy of @AndyGrant (albeit measured in an alternate implementation) Dev 749240 nps Base 748495 nps Gain 0.100% 289936 games STC: LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes official-stockfish#5261 No functional change

mstembera · 2024-05-18T08:54:29Z

Congrats! It looks like changing _mm256_packs_epi32() to _mm256_packus_epi32() makes the largest positive values go from 32767 to 65535. I don't understand how that doesn't change the results but it seems not to. Noting that for positive values both _mm256_srai_epi16() and _mm256_srli_epi16() are equivalent both shifting in 0's.

cj5716 · 2024-05-18T09:01:54Z

Though that is true, for _mm256_srai_epi16() it treats the input vectors as int16s rather than u16s (which they are as outputs from packus), so any value above 32767 is regarded as negative even though it isn't, because the 16th bit is occupied, which is interpreted as the sign bit.

AndyGrant · 2024-05-18T09:12:27Z

Suppose an integer coming out of simd_packs_epi32 has a value of 32767, and coming out of simd_packus_epi32 it was a value of 65535.

For the case of 32767, the next step shifts down by 6, resulting in 511.The next step packs this again, down to int8, using signed saturation, producing a max of 127.

For the case of 65535, the next step shifts down by 6, resulting in 1023. The next step packs this again, down to int8, using signed saturation, producing a max of 127.

Both have the same result. More generally, any input to the crelu whose value is >= 127 << WeightScaleBits = 127 << 6 = 8128 will have the same output.

--

This idea can and has been generalized to other places. 3+ years ago I swapped Ethereal to do this doing the halfkp relu phase. Unfortunately it does not seem to be useful to Stockfish ( or Torch ) for that part of the code. If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal.

mstembera · 2024-05-18T09:27:31Z

Ah ok so the subsequent call to _mm256_packs_epi16() makes it work out in the end. Thanks for the explanation.

mstembera · 2024-05-18T10:15:20Z

This was the only place USE_SSE41 was actually used. Seems we could remove the x86-64-sse41-popcnt and x86-32-sse41-popcnt ARCHs.

vondele · 2024-05-18T11:23:28Z

This was the only place USE_SSE41 was actually used. Seems we could remove the x86-64-sse41-popcnt and x86-32-sse41-popcnt ARCHs.

generally agree with that, now this arch is linked to from the website and used in many places (I believe that was the old 'modern'), so removing that would need some care (CI, website, docs) and thus best done in a separate PR I think.

vondele · 2024-05-18T16:06:47Z

seems like the PR doesn't compile on a number of target (see CI).

AndyGrant · 2024-05-18T16:13:32Z

That would be because this is an sse4.1 thing. At least, _mm_packs_epi32 is SSE4.1, while _mm_packs_epi16 is SSE2. You would think you could maybe swap the order of things around, to accomplish the same result, but you can't. If the final pack was an unsigned saturate, you would allow values > 127, which is a functional change. It would also allow for overflows in future layers due to maddubs, as its important that inputs are within [0..127] and not [0..255].

mstembera · 2024-05-18T19:03:35Z

Indeed _mm_packus_epi32 is SSE4.1 https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.htm#text=_mm_packus_epi32&ig_expand=4880 Looks like the SSE2 portion will have to stay the same.

Sopel97 · 2024-05-18T20:02:28Z

src/nnue/layers/clipped_relu.h

            }
        }
        constexpr IndexType Start = InputDimensions % SimdWidth == 0
-                                    ? InputDimensions / SimdWidth * SimdWidth
+                                    ? InputDimensions


this is not equivalent

If InputDimensions % SimdWidth == 0 is true then InputDimensions / SimdWidth * SimdWidth should result in InputDimensions no?

right, sorry

I'd still consider keeping it because it conveys some meaning

@AndyGrant

Removes some max calls Some speedup stats, courtesy of @AndyGrant (albeit measured in an alternate implementation) Dev 749240 nps Base 748495 nps Gain 0.100% 289936 games STC: LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes official-stockfish#5261 No functional change

cj5716 · 2024-05-19T00:19:28Z

Now that it has been fixed, I don't think this is much of a simplification anymore, although the avx2 and sse41 paths have been simplified, it required more code duplication for sse2

mstembera · 2024-05-19T00:36:45Z

We don't have to add new SSE4.1 code. We could just leave the sse2 part as is in master or still completely remove the USE_SSE41 stuff as I doubt it's contributing much of anything measurable.

cj5716 · 2024-05-19T00:42:59Z

I've had to duplicate some code here https://github.com/official-stockfish/Stockfish/pull/5261/files#diff-4b4f6c1b9d116396315d89c89e9b100afacb490db79c275338d814b9109083dfL112, unless you mean leaving that part as is as well?

mstembera · 2024-05-19T00:49:35Z

Ya that was my suggestion but it's really up to the maintainers.

vondele · 2024-05-19T07:29:31Z

I'd guess simplifying avx2 and sse41 parts is worth it, eventually I can see we could drop sse2-only specialization?

Download stats of SF here https://hanadigital.github.io/grev/?user=official-stockfish&repo=Stockfish should help decide?
(click on sf16.1 release to have enough data to look at the architecture specific downloads).

@AndyGrant

Removes some max calls Some speedup stats, courtesy of @AndyGrant (albeit measured in an alternate implementation) Dev 749240 nps Base 748495 nps Gain 0.100% 289936 games STC: LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes official-stockfish#5261 No functional change

cj5716 · 2024-05-19T07:47:09Z

I'd presume the current state is fine then (simplifying avx2 and sse41, maintaining sse2)

cj5716 force-pushed the better-clipped-relu branch from 2a29ad5 to 6742ef5 Compare May 18, 2024 08:27

cj5716 force-pushed the better-clipped-relu branch from 6742ef5 to 0a8372c Compare May 18, 2024 08:29

cj5716 force-pushed the better-clipped-relu branch 2 times, most recently from c6f03c0 to 3d4c2eb Compare May 18, 2024 08:32

cj5716 force-pushed the better-clipped-relu branch from 3d4c2eb to 1ada8a4 Compare May 18, 2024 08:42

vondele requested a review from Sopel97 May 18, 2024 08:43

Sopel97 reviewed May 18, 2024

View reviewed changes

cj5716 force-pushed the better-clipped-relu branch from 1ada8a4 to 0302b72 Compare May 19, 2024 00:18

cj5716 force-pushed the better-clipped-relu branch from 0302b72 to e76a34a Compare May 19, 2024 07:46

vondele added simplification A simplification patch to be merged Will be merged shortly labels May 21, 2024

vondele closed this in 27eb49a May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify ClippedReLU #5261

Simplify ClippedReLU #5261

cj5716 commented May 18, 2024 •

edited

Loading

github-actions bot commented May 18, 2024 •

edited

Loading

mstembera commented May 18, 2024

cj5716 commented May 18, 2024 •

edited

Loading

AndyGrant commented May 18, 2024 •

edited

Loading

mstembera commented May 18, 2024

mstembera commented May 18, 2024

vondele commented May 18, 2024 •

edited

Loading

vondele commented May 18, 2024

AndyGrant commented May 18, 2024 •

edited

Loading

mstembera commented May 18, 2024

Sopel97 May 18, 2024

mstembera May 18, 2024

Sopel97 May 18, 2024

cj5716 commented May 19, 2024

mstembera commented May 19, 2024 •

edited

Loading

cj5716 commented May 19, 2024

mstembera commented May 19, 2024

vondele commented May 19, 2024

cj5716 commented May 19, 2024

Simplify ClippedReLU #5261

Simplify ClippedReLU #5261

Conversation

cj5716 commented May 18, 2024 • edited Loading

github-actions bot commented May 18, 2024 • edited Loading

mstembera commented May 18, 2024

cj5716 commented May 18, 2024 • edited Loading

AndyGrant commented May 18, 2024 • edited Loading

mstembera commented May 18, 2024

mstembera commented May 18, 2024

vondele commented May 18, 2024 • edited Loading

vondele commented May 18, 2024

AndyGrant commented May 18, 2024 • edited Loading

mstembera commented May 18, 2024

Sopel97 May 18, 2024

Choose a reason for hiding this comment

mstembera May 18, 2024

Choose a reason for hiding this comment

Sopel97 May 18, 2024

Choose a reason for hiding this comment

cj5716 commented May 19, 2024

mstembera commented May 19, 2024 • edited Loading

cj5716 commented May 19, 2024

mstembera commented May 19, 2024

vondele commented May 19, 2024

cj5716 commented May 19, 2024

cj5716 commented May 18, 2024 •

edited

Loading

github-actions bot commented May 18, 2024 •

edited

Loading

cj5716 commented May 18, 2024 •

edited

Loading

AndyGrant commented May 18, 2024 •

edited

Loading

vondele commented May 18, 2024 •

edited

Loading

AndyGrant commented May 18, 2024 •

edited

Loading

mstembera commented May 19, 2024 •

edited

Loading