-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify ClippedReLU #5261
Simplify ClippedReLU #5261
Conversation
Removes some max calls, and removes the need for SSSE3 specialisation. LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes official-stockfish#5261 No functional change
2a29ad5
to
6742ef5
Compare
clang-format 17 needs to be run on this PR. (execution 9146055348 / attempt 1) |
Removes some max calls, and removes the need for SSSE3 specialisation. Some speedup stats, courtesy of @AndyGrant Dev 749240 nps Base 748495 nps Gain 0.100% 289936 games LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes official-stockfish#5261 No functional change
6742ef5
to
0a8372c
Compare
Removes some max calls, and removes the need for SSSE3 specialisation. Some speedup stats, courtesy of @AndyGrant Dev 749240 nps Base 748495 nps Gain 0.100% 289936 games LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes official-stockfish#5261 No functional change
c6f03c0
to
3d4c2eb
Compare
Removes some max calls, and removes the need for SSSE3 specialisation. Some speedup stats, courtesy of @AndyGrant Dev 749240 nps Base 748495 nps Gain 0.100% 289936 games STC: LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes official-stockfish#5261 No functional change
Removes some max calls, and removes the need for SSSE3 specialisation. Some speedup stats, courtesy of @AndyGrant (albeit measured in an alternate implementation) Dev 749240 nps Base 748495 nps Gain 0.100% 289936 games STC: LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes official-stockfish#5261 No functional change
3d4c2eb
to
1ada8a4
Compare
Congrats! It looks like changing _mm256_packs_epi32() to _mm256_packus_epi32() makes the largest positive values go from 32767 to 65535. I don't understand how that doesn't change the results but it seems not to. Noting that for positive values both _mm256_srai_epi16() and _mm256_srli_epi16() are equivalent both shifting in 0's. |
Though that is true, for _mm256_srai_epi16() it treats the input vectors as int16s rather than u16s (which they are as outputs from packus), so any value above 32767 is regarded as negative even though it isn't, because the 16th bit is occupied, which is interpreted as the sign bit. |
Suppose an integer coming out of simd_packs_epi32 has a value of 32767, and coming out of simd_packus_epi32 it was a value of 65535. For the case of 32767, the next step shifts down by 6, resulting in 511.The next step packs this again, down to int8, using signed saturation, producing a max of 127. For the case of 65535, the next step shifts down by 6, resulting in 1023. The next step packs this again, down to int8, using signed saturation, producing a max of 127. Both have the same result. More generally, any input to the crelu whose value is -- This idea can and has been generalized to other places. 3+ years ago I swapped Ethereal to do this doing the halfkp relu phase. Unfortunately it does not seem to be useful to Stockfish ( or Torch ) for that part of the code. If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal. |
Ah ok so the subsequent call to _mm256_packs_epi16() makes it work out in the end. Thanks for the explanation. |
This was the only place USE_SSE41 was actually used. Seems we could remove the x86-64-sse41-popcnt and x86-32-sse41-popcnt ARCHs. |
generally agree with that, now this arch is linked to from the website and used in many places (I believe that was the old 'modern'), so removing that would need some care (CI, website, docs) and thus best done in a separate PR I think. |
seems like the PR doesn't compile on a number of target (see CI). |
That would be because this is an sse4.1 thing. At least, |
Indeed _mm_packus_epi32 is SSE4.1 https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.htm#text=_mm_packus_epi32&ig_expand=4880 Looks like the SSE2 portion will have to stay the same. |
src/nnue/layers/clipped_relu.h
Outdated
} | ||
} | ||
constexpr IndexType Start = InputDimensions % SimdWidth == 0 | ||
? InputDimensions / SimdWidth * SimdWidth | ||
? InputDimensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not equivalent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If InputDimensions % SimdWidth == 0
is true then InputDimensions / SimdWidth * SimdWidth
should result in InputDimensions
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, sorry
I'd still consider keeping it because it conveys some meaning
Removes some max calls Some speedup stats, courtesy of @AndyGrant (albeit measured in an alternate implementation) Dev 749240 nps Base 748495 nps Gain 0.100% 289936 games STC: LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes official-stockfish#5261 No functional change
1ada8a4
to
0302b72
Compare
Now that it has been fixed, I don't think this is much of a simplification anymore, although the avx2 and sse41 paths have been simplified, it required more code duplication for sse2 |
We don't have to add new SSE4.1 code. We could just leave the sse2 part as is in master or still completely remove the USE_SSE41 stuff as I doubt it's contributing much of anything measurable. |
I've had to duplicate some code here https://github.com/official-stockfish/Stockfish/pull/5261/files#diff-4b4f6c1b9d116396315d89c89e9b100afacb490db79c275338d814b9109083dfL112, unless you mean leaving that part as is as well? |
Ya that was my suggestion but it's really up to the maintainers. |
I'd guess simplifying avx2 and sse41 parts is worth it, eventually I can see we could drop sse2-only specialization? Download stats of SF here https://hanadigital.github.io/grev/?user=official-stockfish&repo=Stockfish should help decide? |
Removes some max calls Some speedup stats, courtesy of @AndyGrant (albeit measured in an alternate implementation) Dev 749240 nps Base 748495 nps Gain 0.100% 289936 games STC: LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 203040 W: 52213 L: 52179 D: 98648 Ptnml(0-2): 480, 20722, 59139, 20642, 537 https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c closes official-stockfish#5261 No functional change
0302b72
to
e76a34a
Compare
I'd presume the current state is fine then (simplifying avx2 and sse41, maintaining sse2) |
Removes some max calls
Some speedup stats, courtesy of @AndyGrant (albeit measured in an alternate implementation)
Dev 749240 nps
Base 748495 nps
Gain 0.100%
289936 games
STC:
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 203040 W: 52213 L: 52179 D: 98648
Ptnml(0-2): 480, 20722, 59139, 20642, 537
https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c
closes #5261
No functional change