Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify ClippedReLU #5261

Closed
wants to merge 1 commit into from

Conversation

cj5716
Copy link
Contributor

@cj5716 cj5716 commented May 18, 2024

Removes some max calls

Some speedup stats, courtesy of @AndyGrant (albeit measured in an alternate implementation)
Dev 749240 nps
Base 748495 nps
Gain 0.100%
289936 games

STC:
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 203040 W: 52213 L: 52179 D: 98648
Ptnml(0-2): 480, 20722, 59139, 20642, 537
https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c

closes #5261

No functional change

cj5716 added a commit to cj5716/Stockfish that referenced this pull request May 18, 2024
Removes some max calls, and removes the need for SSSE3 specialisation.

LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 203040 W: 52213 L: 52179 D: 98648
Ptnml(0-2): 480, 20722, 59139, 20642, 537
https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c

closes official-stockfish#5261

No functional change
@cj5716 cj5716 force-pushed the better-clipped-relu branch from 2a29ad5 to 6742ef5 Compare May 18, 2024 08:27
Copy link

github-actions bot commented May 18, 2024

clang-format 17 needs to be run on this PR.
If you do not have clang-format installed, the maintainer will run it when merging.
For the exact version please see https://packages.ubuntu.com/mantic/clang-format-17.

(execution 9146055348 / attempt 1)

cj5716 added a commit to cj5716/Stockfish that referenced this pull request May 18, 2024
Removes some max calls, and removes the need for SSSE3 specialisation.

Some speedup stats, courtesy of @AndyGrant
Dev  749240 nps
Base 748495 nps
Gain 0.100%
289936 games

LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 203040 W: 52213 L: 52179 D: 98648
Ptnml(0-2): 480, 20722, 59139, 20642, 537
https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c

closes official-stockfish#5261

No functional change
@cj5716 cj5716 force-pushed the better-clipped-relu branch from 6742ef5 to 0a8372c Compare May 18, 2024 08:29
cj5716 added a commit to cj5716/Stockfish that referenced this pull request May 18, 2024
Removes some max calls, and removes the need for SSSE3 specialisation.

Some speedup stats, courtesy of @AndyGrant
Dev  749240 nps
Base 748495 nps
Gain 0.100%
289936 games

LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 203040 W: 52213 L: 52179 D: 98648
Ptnml(0-2): 480, 20722, 59139, 20642, 537
https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c

closes official-stockfish#5261

No functional change
@cj5716 cj5716 force-pushed the better-clipped-relu branch 2 times, most recently from c6f03c0 to 3d4c2eb Compare May 18, 2024 08:32
cj5716 added a commit to cj5716/Stockfish that referenced this pull request May 18, 2024
Removes some max calls, and removes the need for SSSE3 specialisation.

Some speedup stats, courtesy of @AndyGrant
Dev  749240 nps
Base 748495 nps
Gain 0.100%
289936 games

STC:
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 203040 W: 52213 L: 52179 D: 98648
Ptnml(0-2): 480, 20722, 59139, 20642, 537
https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c

closes official-stockfish#5261

No functional change
cj5716 added a commit to cj5716/Stockfish that referenced this pull request May 18, 2024
Removes some max calls, and removes the need for SSSE3 specialisation.

Some speedup stats, courtesy of @AndyGrant (albeit measured in an alternate implementation)
Dev  749240 nps
Base 748495 nps
Gain 0.100%
289936 games

STC:
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 203040 W: 52213 L: 52179 D: 98648
Ptnml(0-2): 480, 20722, 59139, 20642, 537
https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c

closes official-stockfish#5261

No functional change
@cj5716 cj5716 force-pushed the better-clipped-relu branch from 3d4c2eb to 1ada8a4 Compare May 18, 2024 08:42
@vondele vondele requested a review from Sopel97 May 18, 2024 08:43
@mstembera
Copy link
Contributor

Congrats! It looks like changing _mm256_packs_epi32() to _mm256_packus_epi32() makes the largest positive values go from 32767 to 65535. I don't understand how that doesn't change the results but it seems not to. Noting that for positive values both _mm256_srai_epi16() and _mm256_srli_epi16() are equivalent both shifting in 0's.

@cj5716
Copy link
Contributor Author

cj5716 commented May 18, 2024

Though that is true, for _mm256_srai_epi16() it treats the input vectors as int16s rather than u16s (which they are as outputs from packus), so any value above 32767 is regarded as negative even though it isn't, because the 16th bit is occupied, which is interpreted as the sign bit.

@AndyGrant
Copy link

AndyGrant commented May 18, 2024

Suppose an integer coming out of simd_packs_epi32 has a value of 32767, and coming out of simd_packus_epi32 it was a value of 65535.

For the case of 32767, the next step shifts down by 6, resulting in 511.The next step packs this again, down to int8, using signed saturation, producing a max of 127.

For the case of 65535, the next step shifts down by 6, resulting in 1023. The next step packs this again, down to int8, using signed saturation, producing a max of 127.

Both have the same result. More generally, any input to the crelu whose value is >= 127 << WeightScaleBits = 127 << 6 = 8128 will have the same output.

--

This idea can and has been generalized to other places. 3+ years ago I swapped Ethereal to do this doing the halfkp relu phase. Unfortunately it does not seem to be useful to Stockfish ( or Torch ) for that part of the code. If mullo_epi16 would preserve the signedness, then this could be used to remove 50% of the max operations during the halfkp-pairwise mat-mul relu deal.

@mstembera
Copy link
Contributor

Ah ok so the subsequent call to _mm256_packs_epi16() makes it work out in the end. Thanks for the explanation.

@mstembera
Copy link
Contributor

This was the only place USE_SSE41 was actually used. Seems we could remove the x86-64-sse41-popcnt and x86-32-sse41-popcnt ARCHs.

@vondele
Copy link
Member

vondele commented May 18, 2024

This was the only place USE_SSE41 was actually used. Seems we could remove the x86-64-sse41-popcnt and x86-32-sse41-popcnt ARCHs.

generally agree with that, now this arch is linked to from the website and used in many places (I believe that was the old 'modern'), so removing that would need some care (CI, website, docs) and thus best done in a separate PR I think.

@vondele
Copy link
Member

vondele commented May 18, 2024

seems like the PR doesn't compile on a number of target (see CI).

@AndyGrant
Copy link

AndyGrant commented May 18, 2024

That would be because this is an sse4.1 thing. At least, _mm_packs_epi32 is SSE4.1, while _mm_packs_epi16 is SSE2. You would think you could maybe swap the order of things around, to accomplish the same result, but you can't. If the final pack was an unsigned saturate, you would allow values > 127, which is a functional change. It would also allow for overflows in future layers due to maddubs, as its important that inputs are within [0..127] and not [0..255].

@mstembera
Copy link
Contributor

Indeed _mm_packus_epi32 is SSE4.1 https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.htm#text=_mm_packus_epi32&ig_expand=4880 Looks like the SSE2 portion will have to stay the same.

}
}
constexpr IndexType Start = InputDimensions % SimdWidth == 0
? InputDimensions / SimdWidth * SimdWidth
? InputDimensions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not equivalent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If InputDimensions % SimdWidth == 0 is true then InputDimensions / SimdWidth * SimdWidth should result in InputDimensions no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, sorry

I'd still consider keeping it because it conveys some meaning

cj5716 added a commit to cj5716/Stockfish that referenced this pull request May 19, 2024
Removes some max calls

Some speedup stats, courtesy of @AndyGrant (albeit measured in an alternate implementation)
Dev  749240 nps
Base 748495 nps
Gain 0.100%
289936 games

STC:
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 203040 W: 52213 L: 52179 D: 98648
Ptnml(0-2): 480, 20722, 59139, 20642, 537
https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c

closes official-stockfish#5261

No functional change
@cj5716 cj5716 force-pushed the better-clipped-relu branch from 1ada8a4 to 0302b72 Compare May 19, 2024 00:18
@cj5716
Copy link
Contributor Author

cj5716 commented May 19, 2024

Now that it has been fixed, I don't think this is much of a simplification anymore, although the avx2 and sse41 paths have been simplified, it required more code duplication for sse2

@mstembera
Copy link
Contributor

mstembera commented May 19, 2024

We don't have to add new SSE4.1 code. We could just leave the sse2 part as is in master or still completely remove the USE_SSE41 stuff as I doubt it's contributing much of anything measurable.

@cj5716
Copy link
Contributor Author

cj5716 commented May 19, 2024

I've had to duplicate some code here https://github.com/official-stockfish/Stockfish/pull/5261/files#diff-4b4f6c1b9d116396315d89c89e9b100afacb490db79c275338d814b9109083dfL112, unless you mean leaving that part as is as well?

@mstembera
Copy link
Contributor

Ya that was my suggestion but it's really up to the maintainers.

@vondele
Copy link
Member

vondele commented May 19, 2024

I'd guess simplifying avx2 and sse41 parts is worth it, eventually I can see we could drop sse2-only specialization?

Download stats of SF here https://hanadigital.github.io/grev/?user=official-stockfish&repo=Stockfish should help decide?
(click on sf16.1 release to have enough data to look at the architecture specific downloads).

Removes some max calls

Some speedup stats, courtesy of @AndyGrant (albeit measured in an alternate implementation)
Dev  749240 nps
Base 748495 nps
Gain 0.100%
289936 games

STC:
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 203040 W: 52213 L: 52179 D: 98648
Ptnml(0-2): 480, 20722, 59139, 20642, 537
https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2c

closes official-stockfish#5261

No functional change
@cj5716 cj5716 force-pushed the better-clipped-relu branch from 0302b72 to e76a34a Compare May 19, 2024 07:46
@cj5716
Copy link
Contributor Author

cj5716 commented May 19, 2024

I'd presume the current state is fine then (simplifying avx2 and sse41, maintaining sse2)

@vondele vondele added simplification A simplification patch to be merged Will be merged shortly labels May 21, 2024
@vondele vondele closed this in 27eb49a May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
simplification A simplification patch to be merged Will be merged shortly
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants